Jump to content

URI normalization: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Tag: nowiki added
 
(25 intermediate revisions by 10 users not shown)
Line 1: Line 1:
{{short description|Process by which URLs are standardized}}
{{short description|Process by which URIs are standardized}}
{{Use American English|date=March 2021}}
{{Use mdy dates|date=March 2021}}
{{Distinguish|URL canonicalization}}
{{Distinguish|URL canonicalization}}
[[File:Normalization URL Animation.gif|thumb|right|300px|Types of URL normalization.]]
[[File:Normalization URL Animation.gif|thumb|right|300px|Types of URI normalization.]]
'''URL normalization''' is the process by which [[Uniform Resource Locator|URLs]] are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized URL so it is possible to determine if two syntactically different URLs may be equivalent.
'''URI normalization''' is the process by which [[Uniform Resource Identifier|URIs]] are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.


[[Search engine]]s employ URL normalization in order to {{clarify span|assign importance to web pages|date=April 2014}} and to reduce indexing of duplicate pages. [[Web crawler]]s perform URL normalization in order to avoid crawling the same resource more than once. [[Web browsers]] may perform normalization to determine if a link has been visited or to determine if a [[Web cache|page has been cached]].
[[Search engine]]s employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages. [[Web crawler]]s perform URI normalization in order to avoid crawling the same resource more than once. [[Web browser]]s may perform normalization to determine if a link has been visited or to determine if a [[Web cache|page has been cached]]. [[Web server]]s may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).


==Normalization process==
==Normalization process==
Line 10: Line 12:


===Normalizations that preserve semantics===
===Normalizations that preserve semantics===
The following normalizations are described in RFC 3986 <ref>[https://tools.ietf.org/html/rfc3986#section-6 RFC 3986, Section 6. Normalization and Comparison]</ref> to result in equivalent URLs:
The following normalizations are described in RFC 3986 <ref>[https://tools.ietf.org/html/rfc3986#section-6 RFC 3986, Section 6. Normalization and Comparison]</ref> to result in equivalent URIs:
* '''Converting percent-encoded triplets to uppercase.''' All letters within a [[percent-encoding]] triplet (e.g., "%3A") are case-insensitive, and should be capitalized.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.1 RFC 3986, Section 6.2.2.1. Case Normalization]</ref> Example:
* '''Converting percent-encoded triplets to uppercase.''' The hexadecimal digits within a [[percent-encoding]] triplet of the URI (e.g., <code>%3a</code> versus <code>%3A</code>) are [[case sensitivity|case-insensitive]] and therefore should be normalized to use uppercase letters for the digits A-F.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.1 RFC 3986, Section 6.2.2.1. Case Normalization]</ref> Example:
:<code><nowiki>http://www.example.com/a%c2%b1b</nowiki></code> → <code><nowiki>http://www.example.com/a%C2%B1b</nowiki></code>
:<code><nowiki>http://example.com/foo%2a</nowiki></code> → <code><nowiki>http://example.com/foo%2A</nowiki></code>
* '''Converting the scheme and host to lowercase.''' The [[URI scheme|scheme]] and [[Host (network)|host]] components of the URI are [[case sensitivity|case-insensitive]]. Most normalizers will convert them to lowercase.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.1 RFC 3986, Section 6.2.2.1. Case Normalization]</ref> Example:
* '''Converting the scheme and host to lowercase.''' The [[URI scheme|scheme]] and [[Host (network)|host]] components of the URI are case-insensitive and therefore should be normalized to lowercase.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.1 RFC 3986, Section 6.2.2.1. Case Normalization]</ref> Example:
:<code><nowiki>HTTP://www.Example.com/</nowiki></code> → <code><nowiki>http://www.example.com/</nowiki></code>
:<code><nowiki>HTTP://User@Example.COM/Foo</nowiki></code> → <code><nowiki>http://User@example.com/Foo</nowiki></code>
* '''Decoding percent-encoded triplets of unreserved characters.''' For consistency, percent-encoded octets in the ranges of ''ALPHA'' (<code>%41</code>–<code>%5A</code> and <code>%61</code>–<code>%7A</code>), ''DIGIT'' (<code>%30</code>–<code>%39</code>), hyphen (<code>%2D</code>), period (<code>%2E</code>), underscore (<code>%5F</code>), or tilde (<code>%7E</code>) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.2 RFC 3986, Section 6.2.2.3. Path Segment Normalization]</ref> Example:
* '''Decoding percent-encoded triplets of unreserved characters.''' Percent-encoded triplets of the URI in the ranges of ''ALPHA'' (<code>%41</code>–<code>%5A</code> and <code>%61</code>–<code>%7A</code>), ''DIGIT'' (<code>%30</code>–<code>%39</code>), hyphen (<code>%2D</code>), period (<code>%2E</code>), underscore (<code>%5F</code>), or tilde (<code>%7E</code>) do not require percent-encoding and should be decoded to their corresponding unreserved characters.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.2 RFC 3986, Section 6.2.2.3. Path Segment Normalization]</ref> Example:
:<code><nowiki>http://www.example.com/%7Eusername/</nowiki></code> → <code><nowiki>http://www.example.com/~username/</nowiki></code>
:<code><nowiki>http://example.com/%7Efoo</nowiki></code> → <code><nowiki>http://example.com/~foo</nowiki></code>
* '''Converting an empty path to a "/" path.''' In general, a URI that uses the generic syntax for authority with an empty path should be normalized to a path of "/".<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.3 RFC 3986, Section 6.2.3. Scheme-Based Normalization]</ref> Example:
* '''Removing dot-segments.''' Dot-segments <code>.</code> and <code>..</code> in the path component of the URI should be removed by applying the remove_dot_segments algorithm<ref>[https://tools.ietf.org/html/rfc3986#section-5.2.4 RFC 3986, 5.2.4. Remove Dot Segments]</ref> to the path described in RFC 3986.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.2.3 RFC 3986, 6.2.2.3. Path Segment Normalization]</ref> Example:
:<code><nowiki>http://www.example.com</code> → <code><nowiki>http://www.example.com/</nowiki></code>
:<code><nowiki>http://example.com/foo/./bar/baz/../qux</nowiki></code> → <code><nowiki>http://example.com/foo/bar/qux</nowiki></code>
* '''Removing the default port.''' The [[List of TCP and UDP port numbers|default port]] (port 80 for the “http” scheme) should be removed the URI.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.3 RFC 3986, Section 6.2.3. Scheme-Based Normalization]</ref> Example:
* '''Converting an empty path to a "/" path.''' In presence of an authority component, an empty path component should be normalized to a path component of "/".<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.3 RFC 3986, Section 6.2.3. Scheme-Based Normalization]</ref> Example:
:<code><nowiki>http://www.example.com:80/bar.html</nowiki></code> → <code><nowiki>http://www.example.com/bar.html</nowiki></code>
:<code><nowiki>http://example.com</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>
* '''Removing the default port.''' An empty or [[List of TCP and UDP port numbers|default port]] component of the URI (port 80 for the <code>http</code> scheme) with its ":" delimiter should be removed.<ref>[https://tools.ietf.org/html/rfc3986#section-6.2.3 RFC 3986, Section 6.2.3. Scheme-Based Normalization]</ref> Example:
:<code><nowiki>http://example.com:80/</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>


===Normalizations that usually preserve semantics===
===Normalizations that usually preserve semantics===
For http and https URLs, the following normalizations listed in RFC 3986 may result in equivalent URLs, but are not guaranteed to by the standards:
For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:
* '''Adding trailing /''' Directories (folders) are indicated with a trailing slash and should be included in URLs. Example:
* '''Adding a trailing "/" to a non-empty path.''' Directories (folders) are indicated with a trailing slash and should be included in URIs. Example:
:<code><nowiki>http://www.example.com/alice</nowiki></code> → <code><nowiki>http://www.example.com/alice/</nowiki></code>
:<code><nowiki>http://example.com/foo</nowiki></code> → <code><nowiki>http://example.com/foo/</nowiki></code>
:However, there is no way to know if a URL path component represents a directory or not. RFC 3986 notes that if the former URL redirects to the latter URL, then that is an indication that they are equivalent.
:However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.
* '''Removing dot-segments.''' The segments “..” and “.” can be removed from a URL according to the [[algorithm]] described in RFC 3986 (or a similar algorithm). Example:
:<code><nowiki>http://www.example.com/../a/b/../c/./d.html</nowiki></code> → <code><nowiki>http://www.example.com/a/c/d.html</nowiki></code>
:However, if a removed "<code>..</code>" component, e.g. "<code>b/..</code>", is a [[symlink]] to a directory with a different parent, eliding "<code>b/..</code>" will result in a different path and URL.<ref>{{cite web|url=https://www.securecoding.cert.org/confluence/download/attachments/26017980/08+File+System+Vulnerabilities.pdf |title=Secure Coding in C and C++ |publisher=Securecoding.cert.org |accessdate=2013-08-24}}</ref> In rare cases depending on the web server, this may even be true for the root directory (e.g. "<code>//www.example.com/..</code>" may not be equivalent to "<code>//www.example.com/</code>".


===Normalizations that change semantics===
===Normalizations that change semantics===
Applying the following normalizations result in a semantically different URL although it may refer to the same resource:
Applying the following normalizations result in a semantically different URI although it may refer to the same resource:
* '''Removing directory index.''' Default [[Webserver directory index|directory indexes]] are generally not needed in URLs. Examples:
* '''Removing directory index.''' Default [[Webserver directory index|directory indexes]] are generally not needed in URIs. Examples:
:<code><nowiki>http://www.example.com/default.asp</nowiki></code> → <code><nowiki>http://www.example.com/</nowiki></code>
:<code><nowiki>http://example.com/a/index.html</nowiki></code> → <code><nowiki>http://example.com/a/</nowiki></code>
:<code><nowiki>http://www.example.com/a/index.html</nowiki></code> → <code><nowiki>http://www.example.com/a/</nowiki></code>
:<code><nowiki>http://example.com/default.asp</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>
* '''Removing the fragment.''' The [[Fragment identifier|fragment]] component of a URL is never seen by the server and can sometimes be removed. Example:
* '''Removing the fragment.''' The [[Fragment identifier|fragment]] component of a URI is never seen by the server and can sometimes be removed. Example:
:<code><nowiki>http://www.example.com/bar.html#section1</nowiki></code> → <code><nowiki>http://www.example.com/bar.html</nowiki></code>
:<code><nowiki>http://example.com/bar.html#section1</nowiki></code> → <code><nowiki>http://example.com/bar.html</nowiki></code>
:However, [[Ajax (programming)|AJAX]] applications frequently use the value in the fragment.
:However, [[Ajax (programming)|AJAX]] applications frequently use the value in the fragment.
* '''Replacing IP with domain name.''' Check if the [[IP address]] maps to a domain name. Example:
* '''Replacing IP with domain name.''' Check if the [[IP address]] maps to a domain name. Example:
:<code><nowiki>http://208.77.188.166/</nowiki></code> → <code><nowiki>http://www.example.com/</nowiki></code>
:<code><nowiki>http://208.77.188.166/</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>
:The reverse replacement is rarely safe due to [[Shared web hosting service|virtual web servers]].
:The reverse replacement is rarely safe due to [[Shared web hosting service|virtual web servers]].
* '''Limiting protocols.''' Limiting different [[application layer]] protocols. For example, the “https” scheme could be replaced with “http”. Example:
* '''Limiting protocols.''' Limiting different [[application layer]] protocols. For example, the “https” scheme could be replaced with “http”. Example:
:<code><nowiki>https://www.example.com/</nowiki></code> → <code><nowiki>http://www.example.com/</nowiki></code>
:<code><nowiki>https://example.com/</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>
* '''Removing duplicate slashes''' Paths which include two adjacent slashes could be converted to one. Example:
* '''Removing duplicate slashes''' Paths which include two adjacent slashes could be converted to one. Example:
:<code><nowiki>http://www.example.com/foo//bar.html</nowiki></code> → <code><nowiki>http://www.example.com/foo/bar.html</nowiki></code>
:<code><nowiki>http://example.com/foo//bar.html</nowiki></code> → <code><nowiki>http://example.com/foo/bar.html</nowiki></code>
* '''Removing or adding “www” as the first domain label.''' Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a [[naked domain]]. For example, <code><nowiki>http://example.com/</nowiki></code> and <code><nowiki>http://www.example.com/</nowiki></code> may access the same website. Many websites [[URL redirection|redirect]] the user from the [[WWW prefix|www]] to the non-www address or vice versa. A normalizer may determine if one of these URLs redirects to the other and normalize all URLs appropriately. Example:
* '''Removing or adding “www” as the first domain label.''' Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a [[naked domain]]. For example, <code><nowiki>http://www.example.com/</nowiki></code> and <code><nowiki>http://example.com/</nowiki></code> may access the same website. Many websites [[URL redirection|redirect]] the user from the [[WWW prefix|www]] to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example:
:<code><nowiki>http://www.example.com/</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>
:<code><nowiki>http://www.example.com/</nowiki></code> → <code><nowiki>http://example.com/</nowiki></code>
* '''Sorting the query parameters.''' Some web pages use more than one [[Query string#Web forms|query parameter]] in the URL. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URL. Example:
* '''Sorting the query parameters.''' Some web pages use more than one [[Query string#Web forms|query parameter]] in the URI. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URI. Example:
:<code><nowiki>http://www.example.com/display?lang=en&article=fred</nowiki></code> → <code><nowiki>http://www.example.com/display?article=fred&lang=en</nowiki></code>
:<code><nowiki>http://example.com/display?lang=en&article=fred</nowiki></code> → <code><nowiki>http://example.com/display?article=fred&lang=en</nowiki></code>
:However, the order of parameters in a URL may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.<ref>{{cite web|url=http://benalman.com/news/2009/12/jquery-14-param-demystified/ |title=jQuery 1.4 $.param demystified |publisher=Ben Alman |date=2009-12-20 |accessdate=2013-08-24}}</ref>
:However, the order of parameters in a URI may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.<ref>{{cite web|url=http://benalman.com/news/2009/12/jquery-14-param-demystified/ |title=jQuery 1.4 $.param demystified |publisher=Ben Alman |date=2009-12-20 |access-date=2013-08-24}}</ref>
* '''Removing unused query variables.''' A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:
* '''Removing unused query variables.''' A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:
:<code><nowiki>http://www.example.com/display?id=123&fakefoo=fakebar</nowiki></code> → <code><nowiki>http://www.example.com/display?id=123</nowiki></code>
:<code><nowiki>http://example.com/display?id=123&fakefoo=fakebar</nowiki></code> → <code><nowiki>http://example.com/display?id=123</nowiki></code>
:Note that a parameter without a value is not necessarily an unused parameter.
:Note that a parameter without a value is not necessarily an unused parameter.
* '''Removing default query parameters.''' A default value in the query string may render identically whether it is there or not. Example:
* '''Removing default query parameters.''' A default value in the query string may render identically whether it is there or not. Example:
:<code><nowiki>http://www.example.com/display?id=&sort=ascending</nowiki></code> → <code><nowiki>http://www.example.com/display</nowiki></code>
:<code><nowiki>http://example.com/display?id=&sort=ascending</nowiki></code> → <code><nowiki>http://example.com/display</nowiki></code>
* '''Removing the "?" when the query is empty.''' When the query is empty, there may be no need for the "?". Example:
* '''Removing the "?" when the query is empty.''' When the query is empty, there may be no need for the "?". Example:
:<code><nowiki>http://www.example.com/display?</nowiki></code> → <code><nowiki>http://www.example.com/display</nowiki></code>
:<code><nowiki>http://example.com/display?</nowiki></code> → <code><nowiki>http://example.com/display</nowiki></code>


==Normalization based on URL lists==
==Normalization based on URI lists==
Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL
Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI
:<code><nowiki>http://example.com/story?id=xyz</nowiki></code>
:<code><nowiki>http://example.com/story?id=xyz</nowiki></code>
appears in a crawl log several times along with
appears in a crawl log several times along with
:<code><nowiki>http://example.com/story_xyz</nowiki></code>
:<code><nowiki>http://example.com/story_xyz</nowiki></code>
we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.
we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.


Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.
Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.


==See also==
==See also==
* [[Uniform Resource Locator]]
* [[URL]] (Uniform Resource Locator)
* [[Fragment identifier]]
* [[URI fragment]]
* [[Web crawler]]
* [[Web crawler]]


Line 76: Line 77:
{{Reflist}}
{{Reflist}}
* RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
* RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
* {{cite conference | author1 = Sang Ho Lee | author2 = Sung Jin Kim | author3 = Seok Hoo Hong | last-author-amp = yes | year = 2005 | title = On URL normalization | conference = Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005) | pages = 1076–1085 | url = http://dblab.ssu.ac.kr/publication/LeKi05a.pdf | deadurl = yes | archiveurl = https://web.archive.org/web/20060918115757/http://dblab.ssu.ac.kr/publication/LeKi05a.pdf | archivedate = 2006-09-18 | df = }}
* {{cite conference | author1 = Sang Ho Lee | author2 = Sung Jin Kim | author3 = Seok Hoo Hong | name-list-style = amp | year = 2005 | title = On URL normalization | conference = Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005) | pages = 1076–1085 | url = http://dblab.ssu.ac.kr/publication/LeKi05a.pdf | url-status = dead | archive-url = https://web.archive.org/web/20060918115757/http://dblab.ssu.ac.kr/publication/LeKi05a.pdf | archive-date = 2006-09-18 }}
* {{cite conference |author1=Uri Schonfeld |author2=Ziv Bar-Yossef |author3=Idit Keidar |last-author-amp=yes | year = 2006 | title = Do not crawl in the dust: different URLs with similar text | conference = Proceedings of the 15th international conference on [[World Wide Web]] | pages = 1015–1016 | url = http://www2006.org/programme/item.php?id=p20}}
* {{cite conference |author1=Uri Schonfeld |author2=Ziv Bar-Yossef |author3=Idit Keidar |author3-link=Idit Keidar |name-list-style=amp | year = 2006 | title = Do not crawl in the dust: different URLs with similar text | conference = Proceedings of the 15th international conference on [[World Wide Web]] | pages = 1015–1016 | url = http://www2006.org/programme/item.php?id=p20}}
* {{cite conference |author1=Uri Schonfeld |author2=Ziv Bar-Yossef |author3=Idit Keidar |last-author-amp=yes | year = 2007 | title = Do not crawl in the dust: different URLs with similar text | conference = Proceedings of the 16th international conference on World Wide Web | pages = 111–120 | url = http://www2007.org/paper194.php}}
* {{cite conference |author1=Uri Schonfeld |author2=Ziv Bar-Yossef |author3=Idit Keidar |author4-link=Idit Keidar |name-list-style=amp | year = 2007 | title = Do not crawl in the dust: different URLs with similar text | conference = Proceedings of the 16th international conference on World Wide Web | pages = 111–120 | url = http://www2007.org/paper194.php}}


[[Category:URL]]
[[Category:URL]]

Latest revision as of 07:23, 16 January 2024

Types of URI normalization.

URI normalization is the process by which URIs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URI into a normalized URI so it is possible to determine if two syntactically different URIs may be equivalent.

Search engines employ URI normalization in order to correctly rank pages that may be found with multiple URIs, and to reduce indexing of duplicate pages. Web crawlers perform URI normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached. Web servers may also perform normalization for many reasons (i.e. to be able to more easily intercept security risks coming from client requests, to use only one absolute file name for each resource stored in their caches, named in log files, etc.).

Normalization process[edit]

There are several types of normalization that may be performed. Some of them are always semantics preserving and some may not be.

Normalizations that preserve semantics[edit]

The following normalizations are described in RFC 3986 [1] to result in equivalent URIs:

  • Converting percent-encoded triplets to uppercase. The hexadecimal digits within a percent-encoding triplet of the URI (e.g., %3a versus %3A) are case-insensitive and therefore should be normalized to use uppercase letters for the digits A-F.[2] Example:
http://example.com/foo%2ahttp://example.com/foo%2A
  • Converting the scheme and host to lowercase. The scheme and host components of the URI are case-insensitive and therefore should be normalized to lowercase.[3] Example:
HTTP://[email protected]/Foohttp://[email protected]/Foo
  • Decoding percent-encoded triplets of unreserved characters. Percent-encoded triplets of the URI in the ranges of ALPHA (%41%5A and %61%7A), DIGIT (%30%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) do not require percent-encoding and should be decoded to their corresponding unreserved characters.[4] Example:
http://example.com/%7Efoohttp://example.com/~foo
  • Removing dot-segments. Dot-segments . and .. in the path component of the URI should be removed by applying the remove_dot_segments algorithm[5] to the path described in RFC 3986.[6] Example:
http://example.com/foo/./bar/baz/../quxhttp://example.com/foo/bar/qux
  • Converting an empty path to a "/" path. In presence of an authority component, an empty path component should be normalized to a path component of "/".[7] Example:
http://example.comhttp://example.com/
  • Removing the default port. An empty or default port component of the URI (port 80 for the http scheme) with its ":" delimiter should be removed.[8] Example:
http://example.com:80/http://example.com/

Normalizations that usually preserve semantics[edit]

For http and https URIs, the following normalizations listed in RFC 3986 may result in equivalent URIs, but are not guaranteed to by the standards:

  • Adding a trailing "/" to a non-empty path. Directories (folders) are indicated with a trailing slash and should be included in URIs. Example:
http://example.com/foohttp://example.com/foo/
However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.

Normalizations that change semantics[edit]

Applying the following normalizations result in a semantically different URI although it may refer to the same resource:

  • Removing directory index. Default directory indexes are generally not needed in URIs. Examples:
http://example.com/a/index.htmlhttp://example.com/a/
http://example.com/default.asphttp://example.com/
  • Removing the fragment. The fragment component of a URI is never seen by the server and can sometimes be removed. Example:
http://example.com/bar.html#section1http://example.com/bar.html
However, AJAX applications frequently use the value in the fragment.
  • Replacing IP with domain name. Check if the IP address maps to a domain name. Example:
http://208.77.188.166/http://example.com/
The reverse replacement is rarely safe due to virtual web servers.
  • Limiting protocols. Limiting different application layer protocols. For example, the “https” scheme could be replaced with “http”. Example:
https://example.com/http://example.com/
  • Removing duplicate slashes Paths which include two adjacent slashes could be converted to one. Example:
http://example.com/foo//bar.htmlhttp://example.com/foo/bar.html
  • Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first, the latter being known as a naked domain. For example, http://www.example.com/ and http://example.com/ may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately. Example:
http://www.example.com/http://example.com/
  • Sorting the query parameters. Some web pages use more than one query parameter in the URI. A normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URI. Example:
http://example.com/display?lang=en&article=fredhttp://example.com/display?article=fred&lang=en
However, the order of parameters in a URI may be significant (this is not defined by the standard) and a web server may allow the same variable to appear multiple times.[9]
  • Removing unused query variables. A page may only expect certain parameters to appear in the query; unused parameters can be removed. Example:
http://example.com/display?id=123&fakefoo=fakebarhttp://example.com/display?id=123
Note that a parameter without a value is not necessarily an unused parameter.
  • Removing default query parameters. A default value in the query string may render identically whether it is there or not. Example:
http://example.com/display?id=&sort=ascendinghttp://example.com/display
  • Removing the "?" when the query is empty. When the query is empty, there may be no need for the "?". Example:
http://example.com/display?http://example.com/display

Normalization based on URI lists[edit]

Some normalization rules may be developed for specific websites by examining URI lists obtained from previous crawls or web server logs. For example, if the URI

http://example.com/story?id=xyz

appears in a crawl log several times along with

http://example.com/story_xyz

we may assume that the two URIs are equivalent and can be normalized to one of the URI forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URIs with similar text) rules that can be applied to URI lists. They showed that once the correct DUST rules were found and applied with a normalization algorithm, they were able to find up to 68% of the redundant URIs in a URI list.

See also[edit]

References[edit]

  1. ^ RFC 3986, Section 6. Normalization and Comparison
  2. ^ RFC 3986, Section 6.2.2.1. Case Normalization
  3. ^ RFC 3986, Section 6.2.2.1. Case Normalization
  4. ^ RFC 3986, Section 6.2.2.3. Path Segment Normalization
  5. ^ RFC 3986, 5.2.4. Remove Dot Segments
  6. ^ RFC 3986, 6.2.2.3. Path Segment Normalization
  7. ^ RFC 3986, Section 6.2.3. Scheme-Based Normalization
  8. ^ RFC 3986, Section 6.2.3. Scheme-Based Normalization
  9. ^ "jQuery 1.4 $.param demystified". Ben Alman. December 20, 2009. Retrieved August 24, 2013.