Quote:
Originally posted by kel
The hits algorithm itself is only a high level description of what goes on, there is alot more to actually implementing a search algorithm that can accurately span the content of the web.
|
Really? I found it quite usable. If you google for it, you can get all the maths.
It's not very complex if you have a good matrix library. You just need to be able to repeatedly get the Transpose of the set adjacency matrix and raise it to a power (usually ~10).
Remember, the base set is usually small (only 500 or so pages), plus their links (2,000 pages in total), so you're working on a 2,000 x 2,000 row matrix, and its transpose.
Quote:
Standard (meaning off the shelf) relational databases won't work because they don't store and index properly. They won't work as a low level store because they can't find and read information fast enough, so you can't build an engine on top of it that spans the WWW.
|
If you're going for billions of pages, sure. But for a DIY system, I would recommend sql, you avoid needing to calculate an inverse matrix, for one, and it's likely to be fast enough up to ~100,000 pages or so.
Quote:
You COULD build your own relational database that has the low level design that can access the large amounts of information in less then a second and interact it with it through an SQL query engine. But it would be somewhat pointless.
|
Well, you might wish to consider the file format the Google people outline. I am sure the details are available on google.
Quote:
SQL is a sledgehammer when it comes to the relatively repetitive accesses google has to perform to complete a search.
|
purely relational databases have many advantages, particularly if you decide not to store page caches, or just use a pointer to a filename in a row for the cache.
I think you make a good point from the perspective of a full-scale google implementation, but I think that it is still practical to put together google-like systems, without 6,000-odd pcs and a team of full-time replacers