About Warrick Home

Warrick logo
"My old web hosting company lost my site in its entirety (duh!) when a hard drive died on them. Needless to say that I was peeved, but I do notice that it is available to browse on the wayback machine... Does anyone have any ideas if I can download my full site?" - A request for help at archive.org


Warrick is a utility for reconstructing or recovering a website when a back-up is not available. Warrick will search the Internet Archive, Google, Live Search, and Yahoo for stored pages and images and will save them to your filesystem. Warrick can be ran through our website or as a command-line utility (directions for downloading, installing, and running are given below).

Warrick is most effective at finding cached content in search engines in the first several days after losing the website since the cached versions of pages tend to disappear once the search engine re-crawls your site and can no longer find the pages. Running Warrick multiple times over a period of several days or weeks can increase the number of recovered files because the caches fluctuate daily (especially Yahoo's). Internet Archive's repository is at least 6-12 months out of date, and therefore you will only find content from them if your website has been around at least that long. If they don't have your website archived, you might want to run Warrick again in 6-12 months.

Warrick is named after a fictional forensic scientist with a penchant for gambling. It was built as part of a research project in 2005 by Frank McCown, a Ph.D. student at Old Dominion University. You can read about Warrick and our experiments reconstructing websites here.

If you would like to cite Warrick in your academic publication, please cite the following:

Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen, Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, Proceedings of the 8th ACM International Workshop on Web Information and Data Management (WIDM 2006), p. 67-74, 2006.


Quick links:

  1. How It Works
  2. Downloading
  3. Installing
  4. Running
    1. Basic Operation
    2. Using Specific Web Repositories
    3. Internet Archive
    4. Google
  5. Viewing Reconstructions
  6. Donations
  7. Future Enhancements
  8. Disclaimer


How It Works

Warrick uses the following web repositories when searching for resources to recover:

Web repository Request method Requests per 24 hours
Internet Archive Page scraping 1000
Google Page scraping or API 1000
Yahoo Yahoo API 5000
Live Search Live Search Web API 10,000

Earlier versions of Warrick did use the Google API, but since Google is no longer providing API keys to new users, Warrick is now scraping results by default. We limit Warrick to 1000 queries per day to Google and Internet Archive so Warrick does not become a burden to them.

Warrick is first given a seed URL, the base URL (e.g., http://www.foo.edu/~joe/) for the website which should be reconstructed. This URL is added to the URL queue for recovery.

Warrick first makes queries to all four web repositories asking what URLs it has indexed/cached for the particular site. The search engines are queried using the "site:" parameters (and "inurl:" or "allinurl:" for subsites). Each of the cached URLs are extracted so we can use them for accessing the cached resources later.

Warrick then runs through the URL queue and queries each of the web repositories for the cached or archived version of the resource. Some resources require multiple requests to the repos if the cached URL has not already been retreived.

IA stores all resources in their native (canonical) format. Search engines only store HTML resources in their canonical format. Other resources are usually altered in some way. PDF, PostScript, Word, Excel, and PowerPoint files are converted into HTML for caching. For example, the PDF http://www.cs.odu.edu/~mln/cv.pdf is stored as HTML by Google here: http://search.google.com/search?q=cache:http://www.cs.odu.edu/~mln/cv.pdf, but a version from 2004 is stored in the IA here: http://web.archive.org/web/20040328121537/http://www.cs.odu.edu/~mln/cv.pdf.

When Warrick is recovering a resource that looks like it points to an HTML resource (it ends with /, .htm, .html, .php, .jsp, or .asp), it first asks the search engines if they have the resource cached. If it cannot be found in any of the search engines, IA is then queried.

Each time an HTML resource is recovered, it is parsed for links to other resources, and the links are added to the URL queue. Only URLs that are in and beneath the seed URL are recovered. So if the seed URL is http://www.foo.edu/~joe/, only URLs matching http://www.foo.edu/~joe/* are recovered.

Warrick saves the recovered resources to disk. If a resource is found in more than one web repository, Warrick saves the resource with the most recent date. Some resources, especially images or PDFs, will not have a date associated with them. If the resource is a PDF, PostScript, Word document, or other non-HTML format, then Warrick will choose the IA (canonical) version over the HTML-version of the resource, regardless of its age. This behavior can be changed using a command-line parameter.

A reconstruction summary file is created that lists the URLs that were successfully and unsuccessfully recovered. Here's an example:

timestamporig urlmime type filenameweb repostored date
2005-11-17 10:23:04http://foo.edu/~joe/text/htmlfoo.edu/~joe/index.htmlyahoo2005-11-16
2005-11-17 10:23:08http://foo.edu/~joe/images/hello.gifMISSING
2005-11-17 10:23:13http://foo.edu/~joe/resume.pdfapplication/pdffoo.edu/~joe/resume.pdfgoogle2005-11-09
2005-11-17 10:23:29http://foo.edu/~joe/styles.csstext/cssfoo.edu/~joe/styles.cssia2005-02-14
etc...

The name of the summary file depends on the URL you used to start the reconstruction. If you used http://www.foo.edu/~joe/, the file will be named www.foo.edu.joe_reconstruct_log.txt.

Notice in the summary file that the file names match the URLs of the resources. If you run Warrick on Windows, the file names may not match the original URLs. Windows won't allow \, |, /, :, ?, ", *, <, > in the filename, so these characters are escaped or converted to acceptable characters.

Warrick will continue to recover resources until the URL queue is empty. If the alloted number of daily queries is used up and the URL queue is not yet empty, Warrick will put itself to sleep for 24 hours. When it awakens, the numbered of used queries will be set back to 0, and the reconstruction will continue.

Note: Warrick cannot recover web pages that were never crawled and cached. Therefore pages that are not accessible to search engines (protected by robots.txt or passwords, pages residing in the deep web, or only accessable through Flash or JavaScript) are not accessible to Warrick. Also Warrick cannot reconstruct the server-side components or logic (CGI programs, scripts, databases, etc.) of a website. That means if the bar.php resource is recovered, it will be the client's version of the page, not the file with the PHP code inside.


Downloading

Warrick is available for download here:

    Warrick version 1.7.4 - warrick-1.7.4.zip - Windows           Created: July 29, 2007
    Warrick version 1.7.4 - warrick-1.7.4.tar.gz - Linux/Unix     Created: July 29, 2007

NOTE: If you use Warrick to reconstruct a website that you have lost, please send me an email letting me know: fmccown at cs dot odu dot edu. We are very interested in keeping a log of websites that have been recovered with Warrick.

Warrick is licensed under the GNU General Public License. Warrick is under a lot of revision and will be updated periodically, so make sure you are always running the most recent version.


Installing

Unzip or untar the file in a directory, say c:\warrick or ~/warrick. You may then need to add warrick.pl to your path or just cd to the directory where you installed it.

Warrick has been tested on a Unix platform and on Windows (using ActivePerl). It was written in Perl, so you need Perl 5 installed. You may also need to install the SOAP::Lite CPAN module if it's not already installed.

Here are detailed instructions for installing Perl and SOAP-Lite and running Warrick on Windows.


Running

Warrick must be ran from the command line. It shares many command line parameters with Wget, a popular open source web crawler. We are currently working on a web-based interface that will allow you to run Warrick without downloading and installing it. The interface will likely be available in early 2007.

When reconstructing very large websites, Warrick may run for several days. After Warrick uses up all its daily queries, it will sleep for 24 hours before resuming where it left off. Website reconstruction is a slow process.

If you have problems running Warrick, you may email me at fmccown at cs dot odu dot edu.

Basic Operation

In order to recover a single page (for example, this one) and nothing else, you could run Warrick like so:

warrick.pl  http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html

In order to reconstruct an entire website, you need to use the -r switch. Suppose the website foo.edu/~joe/ was suddenly lost. This is how to run Warrick to reconstruct the entire site using optional parameters -d and -o (quotes around the URL are usually only neccessary when it contains the "&" character):

warrick.pl -r -d -o warrick_log_foo.txt "http://foo.edu/~joe/"

-r : Recursively fetch more URLs
-d : Turn on debug output
-o : Put all warrick output in a log file

A reconstruction summary file called foo.edu.joe_reconstruct_log.txt would be created listing the URLs that were recovered. Because Warrick could run for more than 24 hours, you may want to run it as a background process (adding a & to the end of the command in Linux/Unix).

If you want to run Warrick again on subsequent days to find additional files, you would use the "no clobber" option (-nc) so the files already recovered would not be downloaded again. The downloaded files would be re-processed and parsed for links to missing resources. This is how you might run Warrick using the no clobber option:

warrick.pl -r -d -nc -o warrick_log_foo.txt "http://foo.edu/~joe/"

This would append the new findings onto the reconstruction summary file from the previous run.


If you would like to retrieve every resource stored in all four web repositories, use the "complete recovery" (-c) option. This is useful when a website has been lost and the root page that is cached no longer contains links to the rest of the website.

warrick.pl -r -c "http://foo.edu/~joe/"


If you would like Warrick to ignore the case of the URLs it recovers, use the -ic (ignore-case) option. This is very useful when reconstructing websites that were housed on a Windows server. The Windows filesystem is case-insensitive, so the URL http://foo.org/bar and http://foo.org/BAR refer to the same resource on a Windows web server. Google may have this URL stored as one way and Yahoo another. Warrick will by default treat these as separate URLs although they really refer to the same resource. If the -ic option is used, Warrick will treat these URLs as one and the same. Example:

warrick.pl -r -ic http://foo.edu/~joe/


Using Specific Web Repositories

To reconstruct an entire website using only a subset of all the web repositories, use the -wr command with a comma-separated list of web repositories to use. Use the following abbreviations: g=Google, ia=Internet Archive, ls=Live Search, y=Yahoo. The following example will reconstruct the website using only Google and the Internet Archive:

warrick.pl -r -wr g,ia http://foo.org/

There are several options that pertain to only a specific web repository.

Internet Archive

If you want to only recover resources from a particular year from the Internet Archive, use the -dr option and specify the year. For example, to only recover resources archived in 2003:

warrick.pl -r -wr ia -dr 2003 http://foo.org/
You can see what pages from your website are stored in the Internet Archive from a particular year like this:
http://web.archive.org/web/2003*/http://www.cs.odu.edu/*

You can also use -dr to recover resources from the Internet Archive for a specific date, like May 25, 2003:

warrick.pl -r -wr ia -dr 2003-05-25 http://www.cs.odu.edu/
To see what pages this would recover, query the Internet Archive like so:
http://web.archive.org/web/20030525*/http://www.cs.odu.edu/*

To recover only resources within a particular date range from the Internet Archive, use the -dr option and specify the begin and end dates (inclusive) in this format: yyyy-mm-dd separated by a colon. For example, to recover only resources archived from Feb 1, 2004 to Aug 31, 2005:

warrick.pl -r -wr ia -dr 2004-02-01:2005-08-31 http://www.cs.odu.edu/
You may also leave the date range open-ended. If you want only resources that were archived after Feb 1, 2004, you would use "-dr 2004-02-01:". If you wanted only resources that were archived before Aug 31, 2005, you would use "-dr :2005-08-31".

Google

Warrick by default accesses cached pages by sraping results from www.google.com. Warrick can access Google instead using the Google API if you have already have a key for their SOAP-based API. Keys for the AJAX API will not work. You'll need to place your key in the google_key.txt file (in the same directory as Warrick) before you run Warrick, and you'll need to use the -ga switch. Example:

warrick.pl -r -wr g -ga http://foo.org/

Be careful when running Warrick: Google monitors traffic through www.google.com, and if they suspect you are making automated requests, they will "blacklist" your IP address and will not respond to queries for as long as 12 hours. If Warrick detects that it has been blacklisted, it will sleep for 12 hours and then pick up where it left off. In my experiments, Google has detected me after about 100-150 requests. We cannot be held responsible if Google blacklists your IP address.


Viewing Reconstructions

After reconstructing a website, you may want to view the files that were recovered in your browser. You can open the files directly into your browser or double-click on them to launch the default application associated with the files. The default application is normally determined by the file's extension. If the file extension is .html, the browser is usually the default application. If the extension is .gif, a graphics application may be the default application.

In order to navigate the reconstructed website from your hard drive by clicking on links, you will likely need to convert absolute URLs to relative ones and rename some of the files. For example, if you are viewing a web page that has a link to http://foo.org/index.php?nav=1, clicking on the link will cause the browser to load the URL, not the index.php?nav=1 file on your hard drive. To view the actual file, the absolute link will need to be converted to a relative one, and the file extension may also need to be changed. Warrick can do this for you.

The -k option will convert all absolute URLs to relative URLs (without changing any file names). For example, the URL pointing to "http://foo.edu/~joe/car.html" will be converted to point to the car.html file you just recovered (e.g., "../car.html").

Note that -k will not cause your website to be reconstructed again... it is just changing the recovered files on your hard drive. It is a good idea to create a backup of all the files you have recovered before running this option just in case.

Make sure that you use the same starting URL that you used when you reconstructed your website since this information is used to find the reconstruction summary file.

Example:

warrick.pl -k http://foo.org/


Use the -v option to make your reconstructed website completely browseable off-line. This option does three things: 1) converts all absolute links to relative links (just like the -k option), 2) appends ".html" to all file names of HTML resources that do not already have a .htm or .html extension, and 3) changes all links in all HTML resources to point to the newly renamed files. Additionally, if a file contains a question mark in the filename, the '?' character will be converted into a dash '-' so the file can be opened in the browser.

The -v option is also useful when you have recovered Word, PDF, Excel, etc. files that are actually in HTML format. For example, if Warrick recovered becky.pdf from Google, it would really be an HTML file since Google does not store PDFs in their canonical format. If you try to open becky.pdf in Adobe Acrobat, you'll get an error since the file is not in the PDF format. Using -v, Warrick would rename the file becky.pdf.html so the browser known the file contains HTML. The -v option also changes all the links in recovered pages to point to becky.pdf.html instead of becky.pdf.

The -v option is also useful when recovering resources with URLs that use query strings (the '?' character). These types of resources are usually called dynamic pages. Although dynamic pages are often HTML, they do not have a ".html" extension on them, so loading them in a browser can be problematic. The -v option would rename foo?name=bob to foo-name=bob.html.

Note that -v will not cause your website to be reconstructed again... it is just changing the recovered files on your hard drive. Make sure that you use the same starting URL that you used when you reconstructed your website since this information is used to find the reconstruction summary file.

Example:

warrick.pl -v http://foo.org/


Donations

If you are thankful for getting back your lost website and would like to make a donation (of any amount), please consider giving to the Internet Archive. They are a non-profit organization, so your donation is tax deductible. Plus it's easy to donate on-line using PayPal. If you do make a donation, please inform the Internet Archive that the dontain was prompted by your use of Warrick.


Future Enhancements

The following are enhancements I am planning on making to Warrick. I do not yet have a time frame to make the enhancements.


Disclaimer

Warrick was developed as part of a research project at Old Dominion University. Neither Warrick nor any member of the research group is affiliated with Google, Microsoft, Yahoo, or the Internet Archive. Warrick is provided as is with no warranties or guarantees.




Home Page last modified: