2.6 URLs — HTML Standard

This specification defines the term URL, and defines various algorithms for dealing with URLs, because for historical reasons the rules defined by the URI and IRI specifications are not a complete description of what HTML user agents need to implement to be compatible with Web content.

The term "URL" in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term "URL" as used herein is really called something else altogether. This is a willful violation of RFC 3986. [RFC3986]

2.6.1 Terminology

A string is a valid non-empty URL if it is a valid URL but it is not the empty string.

This specification defines the URL about:legacy-compat as a reserved, though unresolvable, about: URI, for use in DOCTYPEs in HTML documents when needed for compatibility with XML tools. [ABOUT]

2.6.2 Parsing URLs

To parse a URL url into its component parts, the user agent must use the following steps:

Strip leading and trailing whitespace from url.
Parse url in the manner defined by RFC 3986, with the following exceptions:
- Add all characters with code points less than or equal to U+0020 or greater than or equal to U+007F to the <unreserved> production.
- Add the characters U+0022, U+003C, U+003E, U+005B .. U+005E, U+0060, and U+007B .. U+007D to the <unreserved> production.
- Add a single U+0025 PERCENT SIGN character as a second alternative way of matching the <pct-encoded> production, except when the <pct-encoded> is used in the <reg-name> production.
- Add the U+0023 NUMBER SIGN character to the characters allowed in the <fragment> production.
If url doesn't match the <URI-reference> production, even after the above changes are made to the ABNF definitions, then parsing the URL fails with an error. [RFC3986]

Otherwise, parsing url was successful; the components of the URL are substrings of url defined as follows:
<scheme>

The substring matched by the <scheme> production, if any.

<host>

The substring matched by the <host> production, if any.

<port>

The substring matched by the <port> production, if any.

<hostport>

If there is a <scheme> component and a <port> component and the port given by the <port> component is different than the default port defined for the protocol given by the <scheme> component, then <hostport> is the substring that starts with the substring matched by the <host> production and ends with the substring matched by the <port> production, and includes the colon in between the two. Otherwise, it is the same as the <host> component.

<path>
The substring matched by one of the following productions, if one of them was matched:
- <path-abempty>
- <path-absolute>
- <path-noscheme>
- <path-rootless>
- <path-empty>
<query>

The substring matched by the <query> production, if any.

<fragment>

The substring matched by the <fragment> production, if any.

<host-specific>

The substring that follows the substring matched by the <authority> production, or the whole string if the <authority> production wasn't matched.

These parsing rules are a willful violation of RFC 3986 and RFC 3987 (which do not define error handling), motivated by a desire to handle legacy content. [RFC3986] [RFC3987]

2.6.3 Resolving URLs

Resolving a URL is the process of taking a relative URL and obtaining the absolute URL that it implies.

To resolve a URL to an absolute URL relative to either another absolute URL or an element, the user agent must use the following steps. Resolving a URL can result in an error, in which case the URL is not resolvable.

Let url be the URL being resolved.
Let encoding be determined as follows:

If the URL had a character encoding defined when the URL was created or defined

The URL character encoding is as defined.

If the URL came from a script (e.g. as an argument to a method)

The URL character encoding is the script's URL character encoding.

If the URL came from a DOM node (e.g. from an element)

The node has a Document, and the URL character encoding is the document's character encoding.
If encoding is a UTF-16 encoding, then change the value of encoding to UTF-8.
If the algorithm was invoked with an absolute URL to use as the base URL, let base be that absolute URL.

Otherwise, let base be the base URI of the element, as defined by the XML Base specification, with the base URI of the document entity being defined as the document base URL of the Document that owns the element. [XMLBASE]

For the purposes of the XML Base specification, user agents must act as if all Document objects represented XML documents.

It is possible for xml:base attributes to be present even in HTML fragments, as such attributes can be added dynamically using script. (Such scripts would not be conforming, however, as xml:base attributes are not allowed in HTML documents.)

The document base URL of a Document object is the absolute URL obtained by running these substeps:
1. Let fallback base url be the document's address.
2. If fallback base url is about:blank, and the Document's browsing context has a creator browsing context, then let fallback base url be the document base URL of the creator Document instead.
3. If the Document is an iframe srcdoc document, then let fallback base url be the document base URL of the Document's browsing context's browsing context container's Document instead.
4. If there is no base element that has an href attribute, then the document base URL is fallback base url; abort these steps. Otherwise, let url be the value of the href attribute of the first such element.
5. Resolve url relative to fallback base url (thus, the base href attribute isn't affected by xml:base attributes).
6. The document base URL is the result of the previous step if it was successful; otherwise it is fallback base url.
Parse url into its component parts.
If parsing url resulted in a <host> component, then replace the matching substring of url with the string that results from expanding any sequences of percent-encoded octets in that component that are valid UTF-8 sequences into Unicode characters as defined by UTF-8.

If any percent-encoded octets in that component are not valid UTF-8 sequences (e.g. sequences of percent-encoded octets that expand to surrogate code points), then return an error and abort these steps.

Apply the IDNA ToASCII algorithm to the matching substring, with both the AllowUnassigned and UseSTD3ASCIIRules flags set. Replace the matching substring with the result of the ToASCII algorithm.

If ToASCII fails to convert one of the components of the string, e.g. because it is too long or because it contains invalid characters, then return an error and abort these steps. [RFC3490]
If parsing url resulted in a <path> component, then replace the matching substring of url with the string that results from applying the following steps to each character other than U+0025 PERCENT SIGN (%) that doesn't match the original <path> production defined in RFC 3986:
1. Encode the character into a sequence of octets as defined by UTF-8.
2. Replace the character with the percent-encoded form of those octets. [RFC3986]
For instance if url was "//example.com/a^b☺c%FFd%z/?e", then the <path> component's substring would be "/a^b☺c%FFd%z/" and the two characters that would have to be escaped would be "^" and "☺". The result after this step was applied would therefore be that url now had the value "//example.com/a%5Eb%E2%98%BAc%FFd%z/?e".
If parsing url resulted in a <query> component, then replace the matching substring of url with the string that results from applying the following steps to each character other than U+0025 PERCENT SIGN (%) that doesn't match the original <query> production defined in RFC 3986:
1. If the character in question cannot be expressed in the encoding encoding, then replace it with a single 0x3F octet (an ASCII question mark) and skip the remaining substeps for this character.
2. Encode the character into a sequence of octets as defined by the encoding encoding.
3. Replace the character with the percent-encoded form of those octets. [RFC3986]
Apply the algorithm described in RFC 3986 section 5.2 Relative Resolution, using url as the potentially relative URI reference (R), and base as the base URI (Base). [RFC3986]
Apply any relevant conformance criteria of RFC 3986 and RFC 3987, returning an error and aborting these steps if appropriate. [RFC3986] [RFC3987]

For instance, if an absolute URI that would be returned by the above algorithm violates the restrictions specific to its scheme, e.g. a data: URI using the "//" server-based naming authority syntax, then user agents are to treat this as an error instead.
Let result be the target URI (T) returned by the Relative Resolution algorithm.
If result uses a scheme with a server-based naming authority, replace all U+005C REVERSE SOLIDUS (\) characters in result with U+002F SOLIDUS (/) characters.
Return result.

Some of the steps in these rules, for example the processing of U+005C REVERSE SOLIDUS (\) characters, are a willful violation of RFC 3986 and RFC 3987, motivated by a desire to handle legacy content. [RFC3986] [RFC3987]

A URL is an absolute URL if resolving it results in the same output regardless of what it is resolved relative to, and that output is not a failure.

An absolute URL is a hierarchical URL if, when resolved and then parsed, there is a character immediately after the <scheme> component and it is a U+002F SOLIDUS character (/).

An absolute URL is an authority-based URL if, when resolved and then parsed, there are two characters immediately after the <scheme> component and they are both U+002F SOLIDUS characters (//).

2.6.4 URL manipulation and creation

To fragment-escape a string input, a user agent must run the following steps:

Let input be the string to be escaped.
Let position point at the first character of input.
Let output be an empty string.
Loop: If position is past the end of input, then jump to the step labeled end.
If the character in input pointed to by position is in the range U+0000 to U+0020 or is one of the following characters:
- U+0022 QUOTATION MARK character (")
- U+0023 NUMBER SIGN character (#)
- U+0025 PERCENT SIGN character (%)
- U+003C LESS-THAN SIGN character (<)
- U+003E GREATER-THAN SIGN character (>)
- U+005B LEFT SQUARE BRACKET character ([)
- U+005C REVERSE SOLIDUS character (\)
- U+005D RIGHT SQUARE BRACKET character (])
- U+005E CIRCUMFLEX ACCENT character (^)
- U+007B LEFT CURLY BRACKET character ({)
- U+007C VERTICAL LINE character (|)
- U+007D RIGHT CURLY BRACKET character (})
...then append the percent-encoded form of the character to output. [RFC3986]

Otherwise, append the character itself to output.

This escapes any ASCII characters that are not valid in the URI <fragment> production without being escaped.
Advance position to the next character in input.
Return to the step labeled loop.
End: Return output.

2.6.5 Dynamic changes to base URLs

When an xml:base attribute changes, the attribute's element, and all descendant elements, are affected by a base URL change.

When a document's document base URL changes, all elements in that document are affected by a base URL change.

The following are base URL change steps, which run when an element is affected by a base URL change (as defined by the DOM Core specification):

If the element creates a hyperlink

If the absolute URL identified by the hyperlink is being shown to the user, or if any data derived from that URL is affecting the display, then the href attribute should be re-resolved relative to the element and the UI updated appropriately.

For example, the CSS :link/:visited pseudo-classes might have been affected.

If the hyperlink has a ping attribute and its absolute URL(s) are being shown to the user, then the ping attribute's tokens should be re-resolved relative to the element and the UI updated appropriately.

If the element is a q, blockquote, ins, or del element with a cite attribute

If the absolute URL identified by the cite attribute is being shown to the user, or if any data derived from that URL is affecting the display, then the URL should be re-resolved relative to the element and the UI updated appropriately.

Otherwise

The element is not directly affected.

For instance, changing the base URL doesn't affect the image displayed by img elements, although subsequent accesses of the src IDL attribute from script will return a new absolute URL that might no longer correspond to the image being shown.

2.6.6 Interfaces for URL manipulation

An interface that has a complement of URL decomposition IDL attributes has seven attributes with the following definitions:

The attributes defined to be URL decomposition IDL attributes must act as described for the attributes with the same corresponding names in this section.

In addition, an interface with a complement of URL decomposition IDL attributes defines an input, which is a URL that the attributes act on, and a common setter action, which is a set of steps invoked when any of the attributes' setters are invoked.

The seven URL decomposition IDL attributes have similar requirements.

On getting, if the input is an absolute URL that fulfills the condition given in the "getter condition" column corresponding to the attribute in the table below, the user agent must return the part of the input URL given in the "component" column, with any prefixes specified in the "prefix" column appropriately added to the start of the string and any suffixes specified in the "suffix" column appropriately added to the end of the string. Otherwise, the attribute must return the empty string.

On setting, the new value must first be mutated as described by the "setter preprocessor" column, then mutated by %-escaping any characters in the new value that are not valid in the relevant component as given by the "component" column. Then, if the input is an absolute URL and the resulting new value fulfills the condition given in the "setter condition" column, the user agent must make a new string output by replacing the component of the URL given by the "component" column in the input URL with the new value; otherwise, the user agent must let output be equal to the input. Finally, the user agent must invoke the common setter action with the value of output.

When replacing a component in the URL, if the component is part of an optional group in the URL syntax consisting of a character followed by the component, the component (including its prefix character) must be included even if the new value is the empty string.

The previous paragraph applies in particular to the ":" before a <port> component, the "?" before a <query> component, and the "#" before a <fragment> component.

For the purposes of the above definitions, URLs must be parsed using the URL parsing rules defined in this specification.

Attribute	Component	Getter Condition	Prefix	Suffix	Setter Preprocessor	Setter Condition
`protocol`	<scheme>	—	—	U+003A COLON (:)	Remove all trailing U+003A COLON characters (:)	The new value is not the empty string
`host`	<hostport>	input is an authority-based URL	—	—	—	The new value is not the empty string and input is an authority-based URL
`hostname`	<host>	input is an authority-based URL	—	—	Remove all leading U+002F SOLIDUS characters (/)	The new value is not the empty string and input is an authority-based URL
`port`	<port>	input is an authority-based URL, and contained a <port> component (possibly an empty one)	—	—	Remove all characters in the new value from the first that is not in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), if any. Remove any leading U+0030 DIGIT ZERO characters (0) in the new value. If the resulting string is empty, set it to a single U+0030 DIGIT ZERO character (0).	input is an authority-based URL, and the new value, when interpreted as a base-ten integer, is less than or equal to 65535
`pathname`	<path>	input is a hierarchical URL	—	—	If it has no leading U+002F SOLIDUS character (/), prepend a U+002F SOLIDUS character (/) to the new value	input is hierarchical
`search`	<query>	input is a hierarchical URL, and contained a <query> component (possibly an empty one)	U+003F QUESTION MARK (?)	—	Remove one leading U+003F QUESTION MARK character (?), if any	input is a hierarchical URL
`hash`	<fragment>	input contained a non-empty <fragment> component	U+0023 NUMBER SIGN (#)	—	Remove one leading U+0023 NUMBER SIGN character (#), if any	—

Input URL	`search` value	Explanation
`http://example.com/`	empty string	No <query> component in input URL.
`http://example.com/?`	`?`	There is a <query> component, but it is empty. The question mark in the resulting value is the prefix.
`http://example.com/?test`	`?test`	The <query> component has the value "`test`".
`http://example.com/?test#`	`?test`	The (empty) <fragment> component is not part of the <query> component.

The following table is similar; it provides a list of what each of the URL decomposition IDL attributes returns for a given input URL.

Input	`protocol`	`host`	`hostname`	`port`	`pathname`	`search`	`hash`
`http://example.com/carrot#question%3f`	`http:`	`example.com`	`example.com`	(empty string)	`/carrot`	(empty string)	`#question%3f`
`https://www.example.com:4443?`	`https:`	`www.example.com:4443`	`www.example.com`	`4443`	`/`	`?`	(empty string)