Wayback Machine: Difference between revisions

Content deleted Content added
Ok
Tags: Reverted Mobile edit Mobile web edit
Line 97:
On April 17, 2017, reports surfaced of sites that had gone defunct and became [[parked domain]]s that were using robots.txt to exclude themselves from search engines, resulting in them being inadvertently excluded from the Wayback Machine.<ref>{{cite web |title=Robots.txt meant for search engines don't work well for web archives |url=https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |website=Internet Archive |date=April 17, 2017 |access-date=June 29, 2019}}</ref> The Internet Archive changed the policy to now require an explicit exclusion request to remove it from the Wayback Machine.<ref name="using" />
 
.
====Oakland Archive Policy====
Wayback's retroactive exclusion policy is based in part upon ''Recommendations for Managing Removal Requests and Preserving Archival Integrity'' published by the School of Information Management and Systems at [[University of California, Berkeley]] in 2002, which gives a website owner the right to block access to the site's archives.<ref>{{cite web |title=Recommendations for Managing Removal Requests And Preserving Archival Integrity |date=December 14, 2002 |publisher=[[University of California]] |url=http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html |access-date=September 14, 2017 |url-status=live |archive-url=https://web.archive.org/web/20170918025220/http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html |archive-date=September 18, 2017 }}</ref> Wayback has complied with this policy to help avoid expensive litigation.<ref>{{cite web |title=Retroactive robots.txt removal of past crawls AKA Oakland Archive Policy |date=July 7, 2014 |publisher=Internet Archive |url=https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy |access-date=September 14, 2017 |url-status=live |archive-url=https://web.archive.org/web/20171010124036/https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy |archive-date=October 10, 2017 }}</ref>
 
The Wayback retroactive exclusion policy began to relax in 2017, when it stopped honoring robots on U.S. government and military web sites for both crawling and displaying web pages. As of April 2017, Wayback is ignoring robots.txt more broadly, not just for U.S. government websites.<ref>{{cite web |url=http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |title=Robots.txt meant for search engines don't work well for web archives |work=Internet Archive Blogs |first=Mark |last=Graham |date=April 17, 2017 |access-date=April 16, 2017 |url-status=live |archive-url=https://web.archive.org/web/20170417131508/http://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ |archive-date=April 17, 2017}}</ref><ref>{{cite web |title=Archivierung des Internets: Internet Archive ignoriert künftig robots.txt |date=April 25, 2017 |url=https://www.heise.de/newsticker/meldung/Archivierung-des-Internets-Internet-Archive-ignoriert-kuenftig-robots-txt-3693558.html |publisher=heise online |access-date=May 14, 2017 |language=de |url-status=live |archive-url=https://web.archive.org/web/20170427035659/https://www.heise.de/newsticker/meldung/Archivierung-des-Internets-Internet-Archive-ignoriert-kuenftig-robots-txt-3693558.html |archive-date=April 27, 2017}}</ref><ref>{{cite web |title=Suchmaschinen: Internet Archive will künftig Robots.txt-Einträge ignorieren – Golem.de |url=https://www.golem.de/news/suchmaschinen-internet-archive-will-kuenftig-robots-txt-eintraege-ignorieren-1704-127446.html |access-date=May 14, 2017 |language=de |url-status=live |archive-url=https://web.archive.org/web/20170619210648/https://www.golem.de/news/suchmaschinen-internet-archive-will-kuenftig-robots-txt-eintraege-ignorieren-1704-127446.html |archive-date=June 19, 2017}}</ref><ref>{{cite news |title=Internet Archive will ignore robots.txt files to keep historical record accurate |url=https://www.digitaltrends.com/computing/internet-archive-robots-txt/ |newspaper=Digital Trends |access-date=May 14, 2017 |date=April 24, 2017 |url-status=live |archive-url=https://web.archive.org/web/20170516130029/https://www.digitaltrends.com/computing/internet-archive-robots-txt/ |archive-date=May 16, 2017}}</ref>
 
==Uses==