04-08-2004, 08:13 PM | #1 (permalink) |
Upright
|
How to build a search engine?
Does anyone know the basics of how a general search engine works?
I'm curious as to whether a SQL database could be used for such a thing. For instance, say a simple search engine matches the keywords you type in using boolean AND (it matches all of the keywords). That part is simple enough to do in SQL, but the ranking of the results is where I get stumped. Google has a complex algorithm that ranks a page based on it's popularity: it ranks higher if more pages link to that site. What if I just wanted to make a somewhat search engine using a SQL database that would rank the results based on the number of times a keyword appears in the searched text. Does any have an idea how this would be accomplished? It seems difficult to find general theory about search engines on the internet. That might be because if someone comes up with a good idea, they patent the technology? Another question is, would your standard database server (Oracle, MySQL, SQL Server) be able to do text searches through millions of records? If not, what kind of technology would be used here? |
04-08-2004, 08:33 PM | #2 (permalink) |
WARNING: FLAMMABLE
Location: Ask Acetylene
|
No SQL can't be used as the basis for a search engine that spans the web.
The details of googles design isn't made public, you can learn vague bits of information from papers published over the past few years, but not enough to implement. SQL is a relational database. It is not intended for indexing diverse spans of information like the WWW. In any relational database the schema must be defined in advance. The tradeoff is that to make a schema flexible enough for a web search, it would be too expensive to perform a search. Factoid, a google search references 100+ megabytes of data, the smallest block size in the google filesystem is 64 megabytes. Google runs on around 15,000 commodity class PCs spread out in clusters (200+ per cluster) around the world. In every google cluster every piece of information is mirrored no less then three times. Early search engines relied on word frequency and keyword and meta data inserted by the authors of web pages. Authors trying to drive traffic to their sites would insert utter crap to boost their rankings. Google is advanced in that it accounts for a pages rank based on the traffic visiting it, AND based on all the sites that link to it. A site that is heavily linked to will have a higher rank in general. This technique has proven resistant to tampering with by website authors. The exact implementation isn't publicly available and I haven't read anything on the subject in any of the journals I read. You can find an uninformative high level over view at http://www.cs.ubc.ca/~krasic/cse585/brin98anatomy.pdf http://www.google.com/technology/pigeonrank.html
__________________
"It better be funny" Last edited by kel; 04-08-2004 at 08:36 PM.. |
04-10-2004, 01:58 AM | #3 (permalink) |
Once upon a time...
|
Sorry kel, I disagreee.
1) The basic HITS algorithm is available. It's based on locating the adjacency matrix of a base set of pages. If you google for HITS algorithm, there are several pages with detailed explanations of the code. Obvioulsy, it's not exactly what Google does, but it is the same basic algorithm. 2) I don't think that the relational schema is an issue, since HITS and other algorithms are based on well defined data. The relational nature is actually an asssitance, since you can do a lot of the work in-situ. tuckdiddy, you should try one of these books: Modern Information Retrieval Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Addison Wesley, 1999. Data Mining: Concepts and Techniques Jiawei Han and Micheline Kamber, Morgan Kaufmann, 2000. They have useful info on building search engines.
__________________
-- Man Alone ======= Abstainer: a weak person who yields to the temptation of denying himself a pleasure. Ambrose Bierce, The Devil's Dictionary. |
04-10-2004, 01:38 PM | #4 (permalink) |
Devils Cabana Boy
Location: Central Coast CA
|
i belive that google has cache of the web (or of alot of the websites)
i think that is how they do it.
__________________
Donate Blood! "Love is not finding the perfect person, but learning to see an imperfect person perfectly." -Sam Keen |
04-10-2004, 08:02 PM | #5 (permalink) |
WARNING: FLAMMABLE
Location: Ask Acetylene
|
The hits algorithm itself is only a high level description of what goes on, there is alot more to actually implementing a search algorithm that can accurately span the content of the web.
Standard (meaning off the shelf) relational databases won't work because they don't store and index properly. They won't work as a low level store because they can't find and read information fast enough, so you can't build an engine on top of it that spans the WWW. You COULD build your own relational database that has the low level design that can access the large amounts of information in less then a second and interact it with it through an SQL query engine. But it would be somewhat pointless. SQL is a sledgehammer when it comes to the relatively repetitive accesses google has to perform to complete a search.
__________________
"It better be funny" Last edited by kel; 04-10-2004 at 08:18 PM.. |
04-11-2004, 06:08 PM | #7 (permalink) | ||||
Once upon a time...
|
Quote:
It's not very complex if you have a good matrix library. You just need to be able to repeatedly get the Transpose of the set adjacency matrix and raise it to a power (usually ~10). Remember, the base set is usually small (only 500 or so pages), plus their links (2,000 pages in total), so you're working on a 2,000 x 2,000 row matrix, and its transpose. Quote:
Quote:
Quote:
I think you make a good point from the perspective of a full-scale google implementation, but I think that it is still practical to put together google-like systems, without 6,000-odd pcs and a team of full-time replacers
__________________
-- Man Alone ======= Abstainer: a weak person who yields to the temptation of denying himself a pleasure. Ambrose Bierce, The Devil's Dictionary. |
||||
04-14-2004, 03:20 PM | #9 (permalink) | ||
Wehret Den Anfängen!
Location: Ontario, Canada
|
You can buy a google-box from google for intranet use. =)
Btw, manabone, based off the initial post: Quote:
Quote:
__________________
Last edited by JHVH : 10-29-4004 BC at 09:00 PM. Reason: Time for a rest. |
||
04-15-2004, 03:47 PM | #10 (permalink) | |
Once upon a time...
|
Quote:
And MySql can store gigabyte cells. and it's pronounced 'manalone'
__________________
-- Man Alone ======= Abstainer: a weak person who yields to the temptation of denying himself a pleasure. Ambrose Bierce, The Devil's Dictionary. |
|
04-15-2004, 07:44 PM | #11 (permalink) | |
undead
Location: nihilistic freedom
|
Quote:
|
|
Tags |
build, engine, search |
|
|