Tilted Forum Project Discussion Community - View Single Post

kel · 04-08-2004, 08:33 PM

No SQL can't be used as the basis for a search engine that spans the web.

The details of googles design isn't made public, you can learn vague bits of information from papers published over the past few years, but not enough to implement.

SQL is a relational database. It is not intended for indexing diverse spans of information like the WWW. In any relational database the schema must be defined in advance. The tradeoff is that to make a schema flexible enough for a web search, it would be too expensive to perform a search.

Factoid, a google search references 100+ megabytes of data, the smallest block size in the google filesystem is 64 megabytes. Google runs on around 15,000 commodity class PCs spread out in clusters (200+ per cluster) around the world. In every google cluster every piece of information is mirrored no less then three times.

Early search engines relied on word frequency and keyword and meta data inserted by the authors of web pages. Authors trying to drive traffic to their sites would insert utter crap to boost their rankings.

Google is advanced in that it accounts for a pages rank based on the traffic visiting it, AND based on all the sites that link to it. A site that is heavily linked to will have a higher rank in general. This technique has proven resistant to tampering with by website authors. The exact implementation isn't publicly available and I haven't read anything on the subject in any of the journals I read.

You can find an uninformative high level over view at http://www.cs.ubc.ca/~krasic/cse585/brin98anatomy.pdf

http://www.google.com/technology/pigeonrank.html

04-08-2004, 08:33 PM	#2 (permalink)
kel WARNING: FLAMMABLE Location: Ask Acetylene	No SQL can't be used as the basis for a search engine that spans the web. The details of googles design isn't made public, you can learn vague bits of information from papers published over the past few years, but not enough to implement. SQL is a relational database. It is not intended for indexing diverse spans of information like the WWW. In any relational database the schema must be defined in advance. The tradeoff is that to make a schema flexible enough for a web search, it would be too expensive to perform a search. Factoid, a google search references 100+ megabytes of data, the smallest block size in the google filesystem is 64 megabytes. Google runs on around 15,000 commodity class PCs spread out in clusters (200+ per cluster) around the world. In every google cluster every piece of information is mirrored no less then three times. Early search engines relied on word frequency and keyword and meta data inserted by the authors of web pages. Authors trying to drive traffic to their sites would insert utter crap to boost their rankings. Google is advanced in that it accounts for a pages rank based on the traffic visiting it, AND based on all the sites that link to it. A site that is heavily linked to will have a higher rank in general. This technique has proven resistant to tampering with by website authors. The exact implementation isn't publicly available and I haven't read anything on the subject in any of the journals I read. You can find an uninformative high level over view at http://www.cs.ubc.ca/~krasic/cse585/brin98anatomy.pdf http://www.google.com/technology/pigeonrank.html __________________ "It better be funny" Last edited by kel; 04-08-2004 at 08:36 PM..