Originally posted by kel
The hits algorithm itself is only a high level description of what goes on, there is alot more to actually implementing a search algorithm that can accurately span the content of the web.
Really? I found it quite usable. If you google for it, you can get all the maths.
It's not very complex if you have a good matrix library. You just need to be able to repeatedly get the Transpose of the set adjacency matrix and raise it to a power (usually ~10).
Remember, the base set is usually small (only 500 or so pages), plus their links (2,000 pages in total), so you're working on a 2,000 x 2,000 row matrix, and its transpose.
Standard (meaning off the shelf) relational databases won't work because they don't store and index properly. They won't work as a low level store because they can't find and read information fast enough, so you can't build an engine on top of it that spans the WWW.
If you're going for billions of pages, sure. But for a DIY system, I would recommend sql, you avoid needing to calculate an inverse matrix, for one, and it's likely to be fast enough up to ~100,000 pages or so.
You COULD build your own relational database that has the low level design that can access the large amounts of information in less then a second and interact it with it through an SQL query engine. But it would be somewhat pointless.
Well, you might wish to consider the file format the Google people outline. I am sure the details are available on google.
SQL is a sledgehammer when it comes to the relatively repetitive accesses google has to perform to complete a search.
purely relational databases have many advantages, particularly if you decide not to store page caches, or just use a pointer to a filename in a row for the cache.
I think you make a good point from the perspective of a full-scale google implementation, but I think that it is still practical to put together google-like systems, without 6,000-odd pcs and a team of full-time replacers