Protocolv2Spec
Client specification for the Google Safe Browsing v2.2 protocol
Status: CURRENT as of 2009/3/10. This specification is not yet for general use. Do not use this protocol without explicit written permission from Google. Copyright 2007 Google Inc. All Rights Reserved. Authors: Garrett Casto, Oliver Fisher, Raphaƫl Moll, Marria Nazif, and Dan Born Notes: The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 1. BackgroundGoogle provides data for the anti-phishing feature implemented in Firefox 2 and Google Desktop. These clients get their blacklist and whitelist data using an "update protocol". A new protocol, version 2.2, is designed to address some shortcomings of the previous protocol and is described in this specification. Note: This document assumes the reader is familiar with the anti-phishing service. For an overview of the phishing protection in Firefox, see the design doc on Mozilla.org. However readers need not be familiar with the details of the version 1 of the protocol to understand the second version. Version 1 of the update protocol is inefficient and not scalable. Caveats of the version 1 of the protocol include:
Note: This is not a license to use the defined protocol. This is merely a description of the protocol. 2. OverviewVersion 2 of the update protocol is designed with the following characteristics:
As with the previous protocol, the new protocol supports many different blacklists or whitelists. List names are in the form "provider-type-format", e.g. "goog-phish-shavar". Each item in a list will represent an expression that will match a malicious url, but the exact format depends on the list type and how the content is used is application-specific. Note that the rest of the specification will generally talk about lists in terms of blacklists but the protocol itself is agnostic to the content of the list. (See the List Contents section below for details.) The lists are divided into chunks, the smallest unit of data that will be sent to the client. This allows for supporting partial updates to all users, including new users, and allows for more flexibility in choosing which data to send the client. The actual chunk size is determined by the server. There are two kind of chunks:
Chunks are assigned a number, which is a sequence number for chunks of the same type. For example for a given list, there will be:
For a blacklist, "add" chunks contain the new URLs regular expressions or hashes to add to the blacklist and "sub" chunks contains the false positives that need to be removed from the client's blacklist. In contrast with the previous protocol, the server no longer lists all the URLs that need to be expired. To save bandwidth, the server indicates which chunks need to be deleted by specifying a previously-seen "add" chunk number. Note: This spec now references pver 2.2 instead of 2.1. The change to the protocol was in the handling of empty chunks. They were illegal in pver 2.1 and used in pver 2.2 to reduce client request size. This change only affects section 3.5.2. 3. Protocol SpecificationThe client-server exchange uses a simple pull model: the client connects regularly to the server and pulls some updates. The data exchange can be summarized as follows:
Besides the data exchange, the server provides a way for the client to discover which lists are available. 3.1. R-BNFThis document uses a R-BNF notation, which is a mix between Extended BNF and PCRE-style regular expressions:
The following basic rules that describe the US-ASCII character set are also used as defined in RFC 2616:
3.2. HTTP Request for ListThis is used by clients to discover the available list types. 3.2.1. Request's URLThe client performs a request by sending an HTTP POST request to the URI: http://safebrowsing.clients.google.com/safebrowsing/list?client=CLIENTID&appver=CLIENTVER&pver=PVER&wrkey=MACKEY Required CGI parameters:
Optional CGI parameters:
Formal R-BNF description: CLIENTID = (LOALPHA | "-")+ CLIENTVER = DIGIT ["." DIGIT] PVER = DIGIT "." DIGIT MACKEY = (ALPHA | DIGIT)+ Example: http://safebrowsing.clients.google.com/safebrowsing/list?client=myapplication&appver=1.5.2&pver=2.2 Client Behavior:
3.2.2. Request's BodyThere is no body content for this request -- any body data will be ignored by the server. 3.3. HTTP Response for ListThe server replies using the error code and response body of the HTTP response. No specific HTTP headers is set by the server -- some HTTP headers MAY be present but are not authoritative. 3.3.1. Response CodeThe server generates the following HTTP error codes:
3.3.2. Response BodyThere is no data in the response body for codes in 3xx, 4xx and 5xx. The response body may be empty. When present, the response body contains the name of each list that this client can access. Formal R-BNF description of the response body: BODY = ([MAC LF] (LISTNAME LF)*) | (REKEY LF) EOF LISTNAME = (LOALPHA | DIGIT)+ "-" LOALPHA+ "-" (LOALPHA | DIGIT)+ REKEY = "e:pleaserekey" Example: goog-phish-shavar goog-malware-shavar 3.4. HTTP Request for DataThis is used by clients who want to get new data for known list types. 3.4.1. Request's URLThe client performs a datarequest by sending an HTTP POST request to the URI: http://safebrowsing.clients.google.com/safebrowsing/downloads?client=CLIENTID&appver=CLIENTVER&pver=PVER&wrkey=MACKEY CGI parameters are the same as those used in the HTTP Request for List (section 3.2 above.) Formal R-BNF description: CLIENTID = (LOALPHA | "-")+ CLIENTVER = DIGIT ["." DIGIT] PVER = DIGIT "." DIGIT MACKEY = (ALPHA | DIGIT)+ Example: http://safebrowsing.clients.google.com/safebrowsing/downloads?client=myapplication&appver=1.5.2&pver=2.2 Client Behavior:
3.4.2. Request's bodyThe request body is used to specify what the client has and wants:
The format of the body is line oriented. Lines are separated by LF. Lines which cannot be understood are ignored by the server. Formal R-BNF description of the request body: BODY = [SIZE LF] (LIST LF)+ EOF SIZE = "s;" DIGIT+ # Optional size, in kilobytes and >= 1 LIST = LISTNAME ";" (["mac"] | (LISTINFO (":" LISTINFO)* [":mac"])) LISTINFO = CHUNKTYPE ":" CHUNKLIST LISTNAME = (LOALPHA | DIGIT)+ "-" LOALPHA+ "-" (LOALPHA | DIGIT)+ CHUNKTYPE = "a" | "s" # 'Add' or 'Sub' chunks CHUNKLIST = (RANGE | NUMBER) ["," CHUNKLIST] NUMBER = DIGIT+ # Chunk number >= 1 RANGE = NUMBER "-" NUMBER Note that the last line of the body MUST have a trailing line-feed. The size request is optional. If present, the number indicates the ideal maximum response size, in kilobytes, that the server should reply. The size is used a hint by the server; the actual reply size may vary and could be larger or smaller than the ideal size specified by the client. We strongly recommend that clients omit the size field unless they have a special need to limit the response size. Clients who are operating on a small bandwidth, such as a modem, may want to use the size field to limit the response size. However doing so may cause the client to permanently lag behind. If unsure, clients should omit the size field and let the server decide of the appropriate response size. Example 1: goog-phish-shavar;a:1-3,5,8:s:4-5 acme-white-shavar;a:1-7:s:1-2 In this example, the client requests data for two lists. It then lists the chunks it already has for each list type. Example 2: s;200 goog-phish-shavar;a:1-3,5,8:s:4-5 acme-white-shavar;a:1-7:s:1-2 In this example, the client requests a response size of 200 kilobytes for the two given lists. It then lists the chunks it already has for each list type. Note that at first the client has no data so it has no chunk number on its side. Generally speaking if a client does not have any chunks of one type it should not list the corresponding chunk type. Example (inline comments start after a # and are not part of the protocol:) goog-phish-shavar;a:1-5 # The client has 'add' chunks but no 'sub' chunks acme-malware-shavar; # The client has no data for this list. acme-white-shavar;mac # No data here either and it wants a mac Examples of good chunk lists: goog-phish-shavar;a:1-5,10,12:s:3-8 goog-phish-shavar;a:1,2,3,4,5,10,12,15,16 goog-phish-shavar;a:1-5,10,12,15-16 goog-phish-shavar;a:16-10,2-5,4 Examples of bad chunk lists: goog-phish-shavar # Missing ; at end of list name goog-phish-shavar;5-1,16-10 # Missing 'a:' or 's:' for chunk type goog-phish-shavar;a:5-1:s: # Missing chunk numbers for 's:' Server Behavior:
Client Behavior:
3.5. HTTP Response for DataThe server replies using the error code and response body of the HTTP response. No specific HTTP headers is set by the server -- some HTTP headers MAY be present but are not authoritative. 3.5.1. Response CodeThe server generates the following HTTP error codes:
3.5.2. Response BodyThe response body will not be present for codes in 4xx and 5xx. When present, the response body contains the following information:
The body contains both the chunk data (binary blobs) and the metadata describing the chunks. The metadata is line oriented. Formal R-BNF description of the response body: BODY = [(REKEY | MAC) LF] NEXT LF (RESET | (LIST LF)+) EOF NEXT = "n:" DIGIT+ # Minimum delay before polling again in seconds REKEY = "e:pleaserekey" RESET = "r:pleasereset" LIST = "i:" LISTNAME [MAC] (LF LISTDATA)+ LISTNAME = (LOALPHA | DIGIT | "-")+ # e.g. "goog-phish-sha128" MAC = "," (LOALPHA | DIGIT)+ LISTDATA = ((REDIRECT_URL | ADDDEL-HEAD | SUBDEL-HEAD) LF)+ REDIRECT_URL = "u:" URL [MAC] URL = Defined in RFC 1738 ADDDEL-HEAD = "ad:" CHUNKLIST SUBDEL-HEAD = "sd:" CHUNKLIST CHUNKLIST = (RANGE | NUMBER) ["," CHUNKLIST] NUMBER = DIGIT+ # Chunk number >= 1 RANGE = NUMBER "-" NUMBER Generally speaking lines are in the form "keyword colon parameters". The keyword is limited to one character in this implementation. We reserve the right to use longer keywords later. A reset response from the server means to clear out all current data in the database before requesting again. The response doesn't actually contain the data associated with the lists, instead it tells you were the find the data via redirect urls. These urls should be visited in the order that they are given, and if an error is encountered fetching any of the urls then the client must NOT fetch any url after that. Parallel fetching is NOT allowed. formal R-BNF description of redirect response: BODY = (ADD-HEAD | SUB-HEAD)+ ADD-HEAD = "a:" CHUNKNUM ":" HASHLEN ":" CHUNKLEN LF CHUNKDATA # Length in bytes in decimal SUB-HEAD = "s:" CHUNKNUM ":" HASHLEN ":" CHUNKLEN LF CHUNKDATA # Length in bytes in decimal CHUNKNUM = DIGIT+ # Sequence number of the chunk HASHLEN = DIGIT+ # Decimal length of each hash prefix in bytes CHUNKLEN = DIGIT+ # Size of the chunk data in bytes >= 0 CHUNKDATA = <CHUNKLEN number of unsigned bytes> The MACs used in this response are described in detail in section 4. The format for add and sub chunks is exactly the same. The number of bytes is followed by one line-feed character (LF, \n) then a binary blob of the indicated length. The length is expressed in bytes in decimal. The LF that follows CHUNKLEN is a separator and is not counted in the length itself. Chunk data encoding is explained in the next section. Chunk data is both encoded and compressed according to which list type it is (see List Contents section below). The length described by CHUNKLEN is the size after encoding and/or compression. The adddel and subdel chunks are used to expire previous add and sub chunks. Consequently they have no associated chunk data. More than one chunk can be specified, either by listing each number or using a range or a combination of both. When an add chunk is deleted, the client can delete the data associated with that chunk. When a sub chunk is deleted, the client simply no longer reports that it received that sub chunk in the past. If there are no chunks of a given type, the entire LISTDATA will be omitted. That is for example if there are no sub or subdel chunks for a given list, there will be no corresponding "s:" or "sd:" information in the metadata. Chunk types (add, sub, adddel and subdel) can be presented in any order. They can even be intermixed. The order of the chunks depends on the implementation of the server and the clients MUST NOT rely on any empirical behavior. Moreover, the sequence order in which chunks of the same type are present in the stream is not guaranteed. Chunks may be empty. That is, it is prefectly valid for CHUNKLEN to be 0. In this case the prefix size will still be set, but will have no meaning. Chunks may be given this way to prevent fragmentation of requests and reduce request size. In the case of an empty add chunk, it's possible that the client has or will receive a sub chunk that contains an expression that points to the empty add. In this case, the client is allowed to drop the sub expression. Example: n:1200 i:goog-phish-shavar u:cache.google.com/first_redirect_example sd:1,2 i:acme-white-shavar u:cache.google.com/second_redirect_example ad:1-2,4-5,7 sd:2-6 contents of first_redirect_example: a:4:48:1200 [encoded data] s:3:96:100 [encoded data] a:6:32:800 [encoded data] a:7:32:0 s:4:16:40 [encoded data] contents of second_redirect_example: a:9:32:320 [encoded data] a:10:64:320 [encoded data] In this example, there are no adddel chunks for the "goog-phish-shavar" list, and there are no sub chunks for the "acme-white-shavar" redirect response. Server Behavior:
Client Behavior:
3.6. List ContentsThe contents of each chunk depends on the list type that the chunk belongs to. Currently, the possible lists are:
The "shavar" list type relies on suffix/prefix expressions. Each of the suffix/prefix expressions consists of a host suffix (or full host) and a path prefix (or full path). Note that the path prefix consists of full path components. If the expression contains the full path, there may optionally be query parameters appended to the path. Examples: Regular expression: http\:\/\/.*\.a\.b\/mypath\/.* Suffix/prefix expression: a.b/mypath/ Regular expression: http\:\/\/.*.c\.d\/full\/path\.html?myparam=a Suffix/prefix regular expression: c.d/full/path.html?myparam=a 3.6.1. shavar list formatFor the "shavar" list format, hash prefixes are used to reduce bandwidth. A hash prefix is some number of the most significant bytes of a full-length, 256-bit hash. The chunk data header indicates the length of the hash prefixes in that chunk. For add chunks, the encoded chunk data is a list of add host key entries. Each add host key consists of a 32-bit hash prefix for a host key (see below) followed by a one (1) byte hash prefix count that follows. The hash prefixes are of the length specified in the chunk data header. Note that the host key hash prefix is always 32-bits. A host key may be repeated (e.g., there are more than 255 entries for a given host key). There may be no hash prefixes following a host key, and in such cases the one byte count will be 0. This has special meaning and indicates that all urls under a host should be considered a match. Sub chunks are similar, except that in sub host key entries, each hash prefix is preceded by the add chunk number that contained the add expression. ADD-DATA = (HOSTKEY COUNT [PREFIX]*)+ HOSTKEY = <4 unsigned bytes> # 32-bit hash prefix COUNT = <1 unsigned byte> PREFIX = <HASHLEN unsigned bytes> SUB-DATA = (HOSTKEY COUNT [ADDCHUNKNUM] [ADDCHUNKNUM PREFIX]*)+ HOSTKEY = <4 unsigned bytes> # 32-bit hash prefix COUNT = <1 unsigned byte> ADDCHUNKNUM = <4 byte unsigned integer in network byte order> PREFIX = <HASHLEN unsigned bytes> That is, for SUB-DATA, in the case of a COUNT value of 0, only an ADDCHUNKNUM will be present following the COUNT, to indicate the add chunk that contained the host key. In the case of COUNT >= 1, there will be 1 or more [ADDCHUNKNUM PREFIX] pairs. To form a host key, first canonicalize the url then take the three most significant host components if there are three or more components in the blacklist entry, or two host components if a third does not exist. Append a trailing slash (/) to this string, then use the 32 most significant bits of the string's SHAVAR hash as the host key. To be clear, to match a URL against a host key, a client must try matching based upon the two most significant host components, and also the three most significant host components if three such components exist. An exception to the above exists when the host is an IP address. Here, we form the host key by taking the SHAVAR hash of the entire host (IP address), NOT two or three octets of the IP address. Examples (blacklist entry -> host key, before hashing): google.com/ -> google.com/ sb.google.com/abc/ -> sb.google.com/ a.b.c.google.com/123/ -> c.google.com/ Here are some example of "shavar" hashes, based on the examples from FIPS-180-2:
Here's a unit test you can use to validate the key computation (in pseudo-C): // Example B1 from FIPS-180-2 string input1 = "abc"; string output1 = TruncatedSha256Prefix(input1, 32); int expected1[] = { 0xba, 0x78, 0x16, 0xbf }; assert(output1.size() == 4); // 4 bytes == 32 bits for (int i = 0; i < output1.size(); i++) assert(output1[i] == expected1[i]); // Example B2 from FIPS-180-2 string input2 = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq"; string output2 = TruncatedSha256Prefix(input2, 48); int expected2[] = { 0x24, 0x8d, 0x6a, 0x61, 0xd2, 0x06 }; assert(output2.size() == 6); for (int i = 0; i < output2.size(); i++) assert(output2[i] == expected2[i]); // Example B3 from FIPS-180-2 string input3(1000000, 'a'); // 'a' repeated a million times string output3 = TruncatedSha256Prefix(input3, 96); int expected3[] = { 0xcd, 0xc7, 0x6e, 0x5c, 0x99, 0x14, 0xfb, 0x92, 0x81, 0xa1, 0xc7, 0xe2 }; assert(output3.size() == 12); for (int i = 0; i < output3.size(); i++) assert(output3[i] == expected3[i]); 3.7. HTTP Request for Full-Length HashesA client may request the list of full-length hashes for a hash prefix. This usually occurs when a client is about to download content from a url whose calculated hash starts with a prefix listed in a blacklist. See the Lookup section below for details. The client MUST not make this request if it received the full length hashes in a previous request, for each list and chunk that the prefix occurs in. 3.7.1. Request's URLThe client performs a datarequest by sending an HTTP POST request to the URI: http://safebrowsing.clients.google.com/safebrowsing/gethash?client=CLIENTID&appver=CLIENTVER&pver=PVER&wrkey=MACKEY Most CGI parameters are the same as those used in the HTTP Request for List (section 3.2 above.) Formal R-BNF description: CLIENTID = (LOALPHA | "-")+ CLIENTVER = DIGIT ["." DIGIT] PVER = DIGIT "." DIGIT MACKEY = (ALPHA | DIGIT)+ Example: http://safebrowsing.clients.google.com/safebrowsing/gethash?client=myapplication&appver=1.5.2&pver=2.2 Client Behavior:
3.7.2. Request's bodyThe request body specifies the list of hash prefixes for which the client should receive full length hashes. Formal R-BNF description of the request body: BODY = HEADER LF PREFIXES EOF HEADER = PREFIXSIZE ":" LENGTH PREFIXSIZE = DIGIT+ # Size of each prefix in bytes LENGTH = DIGIT+ # Size of PREFIXES in bytes PREFIXES is a list of PREFIXSIZE values. Note that the server returns the full hash for any matching prefixes given in the request. There may be 0 or more matches for each prefix given. 3.8. HTTP Response for Full-Length HashesThe server replies using the error code and response body of the HTTP response. No specific HTTP headers is set by the server -- some HTTP headers MAY be present but are not authoritative. 3.8.1. Response CodeThe server generates the following HTTP error codes:
If there are no hashes starting with the requested prefix the server MUST return HTTP error code 204 and the body of the response MUST contain no data. This is an expected situation and may occur if a client has not yet downloaded an update to a list that deletes the requested prefix. 3.8.2. Response BodyThe response body will not be present for codes in 4xx and 5xx nor for response code 204. When present, the response body contains the following information:
Formal R-BNF description of the response body: BODY = ([MAC LF] HASHENTRY+) | (REKEY LF) EOF HASHENTRY = LISTNAME ":" ADDCHUNK ":" HASHDATALEN LF HASHDATA ADDCHUNK = DIGIT+ # Add chunk number HASHDATALEN = DIGIT+ # Length of HASHDATA HASHDATA = <HASHDATALEN number of unsigned bytes> # Full length hashes in binary MAC = (LOALPHA | DIGIT)+ "MAC LF" is only present if the client requested MACing by sending its wrapped key. HASHDATA is grouped with LISTNAME and ADDCHUNKNUM to correlate it with the previously received (prefix, LISTNAME, ADDCHUNKNUM). Each hash in the response MUST be stored as the full length hash for the prefix in the list and chunk indicated in the response, for the length of time that the hash prefix is a valid entry in the chunk. The full length hash should always be used when performing lookups instead of the prefix, when it is available. Thus, the client MUST not make any further full length hash requests for that hash, unless a client is following the timing requirements set forth in Section 5.1 and would be unable to issue a warning due to timing constraints in Section 6.3 (e.g. the client has just been launched within the past 5 minutes and so has not yet done a list update, or the list is more than 45 minutes out of date because the client has been backed off and is following the update frequency requested by the server.) 4. MACWe support a Message Authentication Code in this protocol, similar to the old protocol. 4.1. HTTP Request for KeyIn order to receive a MAC, the client must request a key from the server over a secure connection. The newkey request should only be called once per client, unless the server requests that the client changes its key (see below). 4.1.1. Request's URLThe client performs a request by sending an HTTP GET request to the URI: https://sb-ssl.google.com/safebrowsing/newkey?client=CLIENTID&appver=CLIENTVER&pver=PVER CGI parameters are the same as those used in the HTTP Request for List (section 3.2 above.) Example: https://sb-ssl.google.com/safebrowsing/newkey?client=myapplication&appver=1.5.2&pver=2.2 4.2. HTTP Response for Key4.2.1. Response CodeThe server generates the following HTTP error codes:
4.2.2. Response BodyThere is no data in the response body for codes in 3xx, 4xx and 5xx. When present, the response body contains a client key and a wrapped key. Formal R-BNF description of the response body: BODY = "clientkey:" LENGTH ":" MACKEY LF "wrappedkey:" LENGTH ":" MACKEY EOF LENGTH = DIGIT+ Example: clientkey:24:pOAblTUiZFkLSv3xRiXKKQ== wrappedkey:24:MTqdJvrixHRGAyfebvaQWYda 4.3. Requesting the MACThe client must include the wrapped key in the request if it wants a MAC. For example, the request will look like this: http://safebrowsing.clients.google.com/safebrowsing/downloads?client=foo&appver=1.5&pver=2.2&wrkey=123 The entire response will be protected by a MAC if the wrkey parameter is set. The client can also request a MAC for any individual list by adding the mac keyword after any desired list: s;200 goog-phish-shavar;a:1-3,5,8:s:4-5 acme-white-shavar;a:1-7:s:1-2:mac In the response, the full body MAC will be listed first, and any list MACs will be listed after the redirect url that they belong to: m:MACOFDATA n:1200 i:goog-phish-shavar u:cache.google.com/redirect_one i:acme-white-shavar u:cache.google.com,FIRST_MAC d:1-2 Boths MACs are HMAC-SHA1 and are websafe base64 encoded. The first MAC is all data in the response after the MAC and trailing LF. The second covers all the data in the redirect response. 4.4. Key ExpirationAt any time, when a client includes the wrkey CGI parameter in a request, the server may prepend "e:pleaserekey" to any response, on a separate line. This indicates that the client key is no longer valid and that the client should request a new key using the newkey request specified above. 5. Request FrequencyProviding the data on the server for updates and lookups requires a fair amount of resources. To help maintain a high quality of service, it may be necessary for the download servers to ask the client to make more or less frequent requests. This is handled differently depending on the type of request. 5.1 HTTP Request for DataWhen requesting a download of data from the server, there are two mechanisms are available to control request frequency:
Client Behavior:
Client Behavior on error or timeout:
5.2 HTTP Request for Full-Length HashesWhen requesting a full-length hash from the server, if the client successfully receives a response, it MUST be stored as the full length hash for the prefix in the list and chunk indicated in the response, for the length of time that the hash prefix is a valid entry in the chunk. The full length hash should always be used when performing lookups instead of the prefix, when it is available. Thus, the client MUST not make any further full length hash requests for that hash. Client behavior on error or timeout:
6. Performing Lookups6.1. CanonicalizationBefore lookup in any list, the url must be canonicalized. We assume that the client has parsed the URL and made it valid according to RFC 2396. If it's an international url, use the ascii punycode representation. The URL must include a path component, e.g. 'http://google.com/' must have a trailing slash. To start, remove any tab (0x09), CR (0x0d), and LF (0x0a) characters from the URL. Note that escape sequences for these characters, e.g. '%0a', should not be removed. If the URL ends in a fragment, the fragment should be removed. For example, 'http://google.com/#frag' would be shortened to 'http://google.com/'. Next, repeatedly URL-unescape the URL until it has no more hex-encodings. To canonicalize the hostname: Extract the hostname from the URL and then follow these steps:
To canonicalize the path:
These path canonicalizations should not be applied to the query parameters. After performing these steps, percent-escape all characters in the URL which are <= ASCII 32, >= 127, "#", or "%". The escapes should use uppercase hex characters. Below is a set of tests that will help validate a canonicalization implementation. Canonicalize("http://host/%25%32%35") = "http://host/%25"; Canonicalize("http://host/%25%32%35%25%32%35") = "http://host/%25%25"; Canonicalize("http://host/%2525252525252525") = "http://host/%25"; Canonicalize("http://host/asdf%25%32%35asd") = "http://host/asdf%25asd"; Canonicalize("http://host/%%%25%32%35asd%%") = "http://host/%25%25%25asd%25%25"; Canonicalize("http://www.google.com/") = "http://www.google.com/"; Canonicalize("http://%31%36%38%2e%31%38%38%2e%39%39%2e%32%36/%2E%73%65%63%75%72%65/%77%77%77%2E%65%62%61%79%2E%63%6F%6D/") = "http://168.188.99.26/.secure/www.ebay.com/"; Canonicalize("http://195.127.0.11/uploads/%20%20%20%20/.verify/.eBaysecure=updateuserdataxplimnbqmn-xplmvalidateinfoswqpcmlx=hgplmcx/") = "http://195.127.0.11/uploads/%20%20%20%20/.verify/.eBaysecure=updateuserdataxplimnbqmn-xplmvalidateinfoswqpcmlx=hgplmcx/"; Canonicalize("http://host%23.com/%257Ea%2521b%2540c%2523d%2524e%25f%255E00%252611%252A22%252833%252944_55%252B") = "http://host%23.com/~a!b@c%23d$e%25f^00&11*22(33)44_55+"; Canonicalize("http://3279880203/blah") = "http://195.127.0.11/blah"; Canonicalize("http://www.google.com/blah/..") = "http://www.google.com/"; Canonicalize("www.google.com/") = "http://www.google.com/"; Canonicalize("www.google.com") = "http://www.google.com/"; Canonicalize("http://www.evil.com/blah#frag") = "http://www.evil.com/blah"; Canonicalize("http://www.GOOgle.com/") = "http://www.google.com/"; Canonicalize("http://www.google.com.../") = "http://www.google.com/"; Canonicalize("http://www.google.com/foo\tbar\rbaz\n2") ="http://www.google.com/foobarbaz2"; Canonicalize("http://www.google.com/q?") = "http://www.google.com/q?"; Canonicalize("http://www.google.com/q?r?") = "http://www.google.com/q?r?"; Canonicalize("http://www.google.com/q?r?s") = "http://www.google.com/q?r?s"; Canonicalize("http://evil.com/foo#bar#baz") = "http://evil.com/foo"; Canonicalize("http://evil.com/foo;") = "http://evil.com/foo;"; Canonicalize("http://evil.com/foo?bar;") = "http://evil.com/foo?bar;"; Canonicalize("http://\x01\x80.com/") = "http://%01%80.com/"; Canonicalize("http://notrailingslash.com") = "http://notrailingslash.com/"; Canonicalize("http://www.gotaport.com:1234/") = "http://www.gotaport.com:1234/"; Canonicalize(" http://www.google.com/ ") = "http://www.google.com/"; Canonicalize("http:// leadingspace.com/") = "http://%20leadingspace.com/"; Canonicalize("http://%20leadingspace.com/") = "http://%20leadingspace.com/"; Canonicalize("%20leadingspace.com/") = "http://%20leadingspace.com/"; Canonicalize("https://www.securesite.com/") = "https://www.securesite.com/"; Canonicalize("http://host.com/ab%23cd") = "http://host.com/ab%23cd"; Canonicalize("http://host.com//twoslashes?more//slashes") = "http://host.com/twoslashes?more//slashes"; 6.2. Simplified Regular Expression LookupCurrently all valid list types rely on suffix/prefix expressions, as described in the List Contents section above. To perform a lookup for a given url, the client will try to form different possible host suffix and path prefix combinations and seeing if they match each list. Depending on the list type, the suffix/prefix combination may be hashed before lookup. For these lookups, only the host and path components of the URL are used. The scheme, username, password, and port are disregarded. If query parameters are present in the url, the client will also include a lookup with the full path and query parameters. For the hostname, the client will try at most 5 different strings. They are:
For the path, the client will also try at most 6 different strings. They are:
The following examples should help illustrate the lookup behavior: For the url http://a.b.c/1/2.html?param=1, the client will try these possible strings: a.b.c/1/2.html?param=1 a.b.c/1/2.html a.b.c/ a.b.c/1/ b.c/1/2.html?param=1 b.c/1/2.html b.c/ b.c/1/ For the url http://a.b.c.d.e.f.g/1.html, the client will try these possible strings: a.b.c.d.e.f.g/1.html a.b.c.d.e.f.g/ (Note: skip b.c.d.e.f.g, since we'll take only the last 5 hostname components, and the full hostname) c.d.e.f.g/1.html c.d.e.f.g/ d.e.f.g/1.html d.e.f.g/ e.f.g/1.html e.f.g/ f.g/1.html f.g/ For the url http://1.2.3.4/1/, the client will try these possible strings: 1.2.3.4/1/ 1.2.3.4/ 6.3. Age of Data, UsageApplications retrieving data using the API must be certain never to use data older than 45 minutes. What this means specifically is that a warning can be shown only in the following 3 scenarios:
Under no other circumstances may a warning be shown. 7. References
|
This protocol doesn't sound very caching proxy friendly. Are there any plans for those with restrictive internet connections with many instances of Firefox to share a cache?
Fowlthe2nd - that's true that the POST requests are not very cache friendly. However, the redirect urls are cacheable, and that's where the bulk of the data is sent to the client. In practice, the number of unique redirect urls is not too high, so we hope that helps with caching.
Suggestion:
In a worst case scenario for an URL the client needs to compute 25 sha256 hashes. If the length of the original string would be available, it could lookup the length and avoid calculating 25 sha256 hashes, and only calculate those hashes where the length matches.
I have just encountered a blocked site on eBay on a seller's site that I have used in the past without any problems. It appears to be a malicious report and there appears to be no way of reporting this. I can see that the whole protocol is open to abuse by users reporting sites in order to remove their competition on sites like eBay. phishing is one thing but libelling a seller in this way is going to make a fortune for lawyers.
The last comment bei James, is absolutly correct. We have a newsletter system for a high value customer. All reciepients theirselves subscribed to the newsletters via double opt-in. Currently the page is blocked by google which causes an enourmous financial damage any worst reputation.
Please provide a way to avoid abuse by unown users or rival companies.
edwintorok - thanks for the suggestion. Note that the hostkey can currently be checked to short stop the full 25 lookups in most cases. To look up the host key, you only need to hash the last 2 components and the last 3 components of the hostname.
James and Jakob - please use this page to report false positives: http://www.google.com/safebrowsing/report_error/?tpl=generic
Also see this page for some more info: http://www.google.com/support/webmasters/bin/answer.py?answer=45432
Thank you for looking out for me, i am going back to the email you sent to better understand what happened to me [email protected]
Por que me bloquean mis mensajes
At the moment, Wikipedia states that the use of the safebrowse protocol is incompatible with the Mozilla Manifesto b/c safebrowse is not open source (it says at the beginning "This specification is not yet for general use. Do not use this protocol without explicit written permission from Google."). On the other hand, the Google code page for safebrowse states that safebrowse is available under a new BSD license. Which is correct?
This is a great effort. You guys at Google are doing a brilliant job. Keep up the good work.
A few comments on Canonicalization:
RFC 2396 has been replaced by RFC 3986. It shouldn't be referenced anymore.
If I repeatedly unescape the URL, it won't pass the set of test (the last url of the test will create a fragment while it shouldn't). For that I need to repeatedly unescape individual components of the URL.
Would it be possible to get a reference implementation of the Canonicalization?
porra
PhilippeLYP: Thanks for the note about RFC 2396, I'll update the references to it when I have a chance.
The canonicalization for http://host.com/ab%23cd would work like this:
1. Strip the fragment. There isn't one, since the # is escaped, so this is a no-op. 2. Repeatedly unescape the URL. This gives http://host.com/ab#cd 3. Host and path canonicalizations won't have any effect on this url. 4. Escape the characters <= 32, >= 127, #, and %. This will give you the result, http://host.com/ab%23cd
Hope that helps.