BackRub

BackRub is a "web crawler" which is designed to traverse the web.

Currently we are developing techniques to improve web search engines. We will make various services available as soon as possible.

Sorry, many services are unavailable due to a local network faliure beyond our control. We are working to fix the problem and hope to be back up soon. 12/4/97

We have a demo that searches the titles of over 16 million urls: BackRub title search demo [http://zam.stanford.edu:2405/]

BackRub search with comparison (type in top box, ignore cgi-bin error) [http://backrub.stanford.edu/cgi-bin/mq] New systems will be coming soon.
Some documentation from a talk about the system is here [http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1997-0072].

BackRub is a research project of the Digital Library Project [http://diglib.stanford.edu/] in the Computer Science Department [http://www-cs.stanford.edu/] at Stanford University [http://www.stanford.edu/].

Some Rough Statistics (from August 29th, 1996)
Total indexable HTML urls: 75.2306 Million
Total content downloaded: 207.022 gigabytes
Total indexable HTML pages downloaded: 30.6255 Million
Total indexable HTML pages which have not been attempted yet: 30.6822 Million
Total robots.txt excluded: 0.224249 Million
Total socket or connection errors: 1.31841 Million

BackRub is written in Java and Python and runs on several Sun Ultras and Intel Pentiums running Linux. The primary database is kept on an Sun Ultra II with 28GB of disk. Scott Hassan [http://www-db.stanford.edu/~hassan/] and Alan Steremberg [http://www-cs-students.stanford.edu/~alans/] have provided a great deal of very talented implementation help. Sergey Brin [http://www-db.stanford.edu/~sergey/] has also been very involved and deserves many thanks.

Before emailing, please read the FAQ [http://backrub.stanford.edu/FAQ.html]. Thanks.

-Larry Page [http://www-pcd.stanford.edu/~page/] page@cs.stanford.edu

BackRub Frequently Asked Questions

If your question is not answered here, please email backrub@pcd.stanford.edu or if you prefer call (415) 723-3154 and ask for Larry.

1) Why is BackRub asking for a file called robots.txt which isn't on my server?

This is a document which can tell BackRub not to download some or all information from your web server. For information on how to create a robots.txt file, see http://info.webcrawler.com/mak/projects/robots/norobots.html.

2) I don't want BackRub visiting my site or part of my site.

There is a standard for robot exclusion at http://info.webcrawler.com/mak/projects/robots/norobots.html. You can put a file on your server called robots.txt which can exclude BackRub or other "web crawlers". BackRub has a user-agent of "BackRub".

3) Why is BackRub trying to download incorrect links from my server? Or from a server that doesn't exist.

It is a property of the web that many links will be broken or outdated at any given time. Whenever anyone in the world types a link incorrectly that points to your site, or fails to update their pages to reflect changes in your server, BackRub will try to download an incorrect link from your site. Also, this is why you may get hits on a machine that is not even a web server.

4) Why is BackRub downloading information from our "Secret" web server?

It is almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, it is likely that your "secret" URL is in the referer tag, and can get stored and possibly published by the other web server in its referer log. So, if there is a link to your "Secret" web server or page on the web anywhere, it is likely that BackRub and other "web crawlers" will find it.

5) I have a robots.txt file. Why isn't BackRub obeying it?

In order to save bandwidth BackRub only downloads the robots.txt file every week or so. So, it may take a while for BackRub to learn of any changes that might have made to your robots.txt file. Also, BackRub is distributed on several machines. Each of these keeps its own record of your robots.txt file. Also, check that your syntax is correct against the standard at: http://info.webcrawler.com/mak/projects/robots/norobots.html.

6) How do I register my site with BackRub so it will be indexed?

There is no way to register, BackRub will find it eventually.

7) Why are there hits from multiple machines at .stanford.edu all with user-agent BackRub?

BackRub was designed to be distributed on several machines, in order to improve performance and scale as the web grows. Also, in order to cut down on bandwidth usage we would like to run many crawlers which run on machines close to the sites they are indexing in the network.

8) Your logo is upside down: Why is the light source obviously below the image? It looks quite unnatural...

The logo is simply a scan of my hand, from a flatbed scanner converted to black and white. The "back" in the picture is the scanner cover, and the shadows are from the scanner light.

For more answers, see the Robots FAQ.