How to build a search engine? - Tilted Forum Project Discussion Community

tuckdiddy · 04-08-2004, 08:13 PM

Does anyone know the basics of how a general search engine works?

I'm curious as to whether a SQL database could be used for such a thing. For instance, say a simple search engine matches the keywords you type in using boolean AND (it matches all of the keywords). That part is simple enough to do in SQL, but the ranking of the results is where I get stumped.

Google has a complex algorithm that ranks a page based on it's popularity: it ranks higher if more pages link to that site.

What if I just wanted to make a somewhat search engine using a SQL database that would rank the results based on the number of times a keyword appears in the searched text.

Does any have an idea how this would be accomplished? It seems difficult to find general theory about search engines on the internet. That might be because if someone comes up with a good idea, they patent the technology?

Another question is, would your standard database server (Oracle, MySQL, SQL Server) be able to do text searches through millions of records? If not, what kind of technology would be used here?

kel · 04-08-2004, 08:33 PM

No SQL can't be used as the basis for a search engine that spans the web.

The details of googles design isn't made public, you can learn vague bits of information from papers published over the past few years, but not enough to implement.

SQL is a relational database. It is not intended for indexing diverse spans of information like the WWW. In any relational database the schema must be defined in advance. The tradeoff is that to make a schema flexible enough for a web search, it would be too expensive to perform a search.

Factoid, a google search references 100+ megabytes of data, the smallest block size in the google filesystem is 64 megabytes. Google runs on around 15,000 commodity class PCs spread out in clusters (200+ per cluster) around the world. In every google cluster every piece of information is mirrored no less then three times.

Early search engines relied on word frequency and keyword and meta data inserted by the authors of web pages. Authors trying to drive traffic to their sites would insert utter crap to boost their rankings.

Google is advanced in that it accounts for a pages rank based on the traffic visiting it, AND based on all the sites that link to it. A site that is heavily linked to will have a higher rank in general. This technique has proven resistant to tampering with by website authors. The exact implementation isn't publicly available and I haven't read anything on the subject in any of the journals I read.

You can find an uninformative high level over view at http://www.cs.ubc.ca/~krasic/cse585/brin98anatomy.pdf

http://www.google.com/technology/pigeonrank.html

manalone · 04-10-2004, 01:58 AM

Sorry kel, I disagreee.

1) The basic HITS algorithm is available. It's based on locating the adjacency matrix of a base set of pages.
If you google for HITS algorithm, there are several pages with detailed explanations of the code.

Obvioulsy, it's not exactly what Google does, but it is the same basic algorithm.

2) I don't think that the relational schema is an issue, since HITS and other algorithms are based on well defined data. The relational nature is actually an asssitance, since you can do a lot of the work in-situ.

tuckdiddy, you should try one of these books:
Modern Information Retrieval
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Addison Wesley, 1999.

Data Mining: Concepts and Techniques
Jiawei Han and Micheline Kamber, Morgan Kaufmann, 2000.

They have useful info on building search engines.

Dilbert1234567 · 04-10-2004, 01:38 PM

i belive that google has cache of the web (or of alot of the websites)

i think that is how they do it.

kel · 04-10-2004, 08:02 PM

The hits algorithm itself is only a high level description of what goes on, there is alot more to actually implementing a search algorithm that can accurately span the content of the web.

Standard (meaning off the shelf) relational databases won't work because they don't store and index properly. They won't work as a low level store because they can't find and read information fast enough, so you can't build an engine on top of it that spans the WWW.

You COULD build your own relational database that has the low level design that can access the large amounts of information in less then a second and interact it with it through an SQL query engine. But it would be somewhat pointless.

SQL is a sledgehammer when it comes to the relatively repetitive accesses google has to perform to complete a search.

Fallon · 04-10-2004, 08:43 PM

There's a debian package you can get in apt that's a search engine.
just a fyi

manalone · 04-11-2004, 06:08 PM

Quote:

Originally posted by kel
The hits algorithm itself is only a high level description of what goes on, there is alot more to actually implementing a search algorithm that can accurately span the content of the web.

Really? I found it quite usable. If you google for it, you can get all the maths.
It's not very complex if you have a good matrix library. You just need to be able to repeatedly get the Transpose of the set adjacency matrix and raise it to a power (usually ~10).

Remember, the base set is usually small (only 500 or so pages), plus their links (2,000 pages in total), so you're working on a 2,000 x 2,000 row matrix, and its transpose.

Quote:

Standard (meaning off the shelf) relational databases won't work because they don't store and index properly. They won't work as a low level store because they can't find and read information fast enough, so you can't build an engine on top of it that spans the WWW.

If you're going for billions of pages, sure. But for a DIY system, I would recommend sql, you avoid needing to calculate an inverse matrix, for one, and it's likely to be fast enough up to ~100,000 pages or so.

Quote:

You COULD build your own relational database that has the low level design that can access the large amounts of information in less then a second and interact it with it through an SQL query engine. But it would be somewhat pointless.

Well, you might wish to consider the file format the Google people outline. I am sure the details are available on google.

Quote:

SQL is a sledgehammer when it comes to the relatively repetitive accesses google has to perform to complete a search.

purely relational databases have many advantages, particularly if you decide not to store page caches, or just use a pointer to a filename in a row for the cache.

I think you make a good point from the perspective of a full-scale google implementation, but I think that it is still practical to put together google-like systems, without 6,000-odd pcs and a team of full-time replacers

tritium · 04-14-2004, 03:03 PM

I think Google *is* the Internet....

The g-Matrix has you. :-P

Yakk · 04-14-2004, 03:20 PM

You can buy a google-box from google for intranet use. =)

Btw, manabone, based off the initial post:

Quote:

Another question is, would your standard database server (Oracle, MySQL, SQL Server) be able to do text searches through millions of records? If not, what kind of technology would be used here?

He's talking about a million-page universe, not a teeny tiny intraweb I think. Which is kel's point. =p

Quote:

He looks like run of the mill Hydrogen, but he's isn't, typed:
I think Google *is* the Internet....

Naw, they aren't the internet, they just have a copy of most of it in RAM. =)

manalone · 04-15-2004, 03:47 PM

Quote:

Originally posted by Yakk
Btw, manabone, based off the initial post:

He's talking about a million-page universe, not a teeny tiny intraweb I think. Which is kel's point. =p

A million pages is not very many pages. As I said before, HITS makes use of far less in each ranking query.

And MySql can store gigabyte cells.

and it's pronounced 'manalone'

nothingx · 04-15-2004, 07:44 PM

Quote:

tuckdiddy, you should try one of these books:
Modern Information Retrieval
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Addison Wesley, 1999.

Man, someone beat me to Baeza-Yates already... this book is the Bible of IR. My senior year in college I took a class in agent based inforamtion retrieval. It was a really cool concept. Instead of having one system that indexes a corpus and running the IR engine, you can have a distributed collection of agents that does each a little on its own to make better results. There's all kinds of complexity in recombining weighted results and determining who does what. Actually, if you're interested, you should check out SIRE http://www.csee.umbc.edu/~cost/sire/... Scott's self-contained IR engine in a box.

04-08-2004, 08:13 PM	#1 (permalink)
tuckdiddy Upright	How to build a search engine? Does anyone know the basics of how a general search engine works? I'm curious as to whether a SQL database could be used for such a thing. For instance, say a simple search engine matches the keywords you type in using boolean AND (it matches all of the keywords). That part is simple enough to do in SQL, but the ranking of the results is where I get stumped. Google has a complex algorithm that ranks a page based on it's popularity: it ranks higher if more pages link to that site. What if I just wanted to make a somewhat search engine using a SQL database that would rank the results based on the number of times a keyword appears in the searched text. Does any have an idea how this would be accomplished? It seems difficult to find general theory about search engines on the internet. That might be because if someone comes up with a good idea, they patent the technology? Another question is, would your standard database server (Oracle, MySQL, SQL Server) be able to do text searches through millions of records? If not, what kind of technology would be used here?

04-08-2004, 08:33 PM	#2 (permalink)
kel WARNING: FLAMMABLE Location: Ask Acetylene	No SQL can't be used as the basis for a search engine that spans the web. The details of googles design isn't made public, you can learn vague bits of information from papers published over the past few years, but not enough to implement. SQL is a relational database. It is not intended for indexing diverse spans of information like the WWW. In any relational database the schema must be defined in advance. The tradeoff is that to make a schema flexible enough for a web search, it would be too expensive to perform a search. Factoid, a google search references 100+ megabytes of data, the smallest block size in the google filesystem is 64 megabytes. Google runs on around 15,000 commodity class PCs spread out in clusters (200+ per cluster) around the world. In every google cluster every piece of information is mirrored no less then three times. Early search engines relied on word frequency and keyword and meta data inserted by the authors of web pages. Authors trying to drive traffic to their sites would insert utter crap to boost their rankings. Google is advanced in that it accounts for a pages rank based on the traffic visiting it, AND based on all the sites that link to it. A site that is heavily linked to will have a higher rank in general. This technique has proven resistant to tampering with by website authors. The exact implementation isn't publicly available and I haven't read anything on the subject in any of the journals I read. You can find an uninformative high level over view at http://www.cs.ubc.ca/~krasic/cse585/brin98anatomy.pdf http://www.google.com/technology/pigeonrank.html __________________ "It better be funny" Last edited by kel; 04-08-2004 at 08:36 PM..

04-10-2004, 01:58 AM	#3 (permalink)
manalone Once upon a time...	Sorry kel, I disagreee. 1) The basic HITS algorithm is available. It's based on locating the adjacency matrix of a base set of pages. If you google for HITS algorithm, there are several pages with detailed explanations of the code. Obvioulsy, it's not exactly what Google does, but it is the same basic algorithm. 2) I don't think that the relational schema is an issue, since HITS and other algorithms are based on well defined data. The relational nature is actually an asssitance, since you can do a lot of the work in-situ. tuckdiddy, you should try one of these books: Modern Information Retrieval Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Addison Wesley, 1999. Data Mining: Concepts and Techniques Jiawei Han and Micheline Kamber, Morgan Kaufmann, 2000. They have useful info on building search engines. __________________ -- Man Alone ======= Abstainer: a weak person who yields to the temptation of denying himself a pleasure. Ambrose Bierce, The Devil's Dictionary.

04-10-2004, 01:38 PM	#4 (permalink)
Dilbert1234567 Devils Cabana Boy Location: Central Coast CA	i belive that google has cache of the web (or of alot of the websites) i think that is how they do it. __________________ Donate Blood! "Love is not finding the perfect person, but learning to see an imperfect person perfectly." -Sam Keen

04-10-2004, 08:02 PM	#5 (permalink)
kel WARNING: FLAMMABLE Location: Ask Acetylene	The hits algorithm itself is only a high level description of what goes on, there is alot more to actually implementing a search algorithm that can accurately span the content of the web. Standard (meaning off the shelf) relational databases won't work because they don't store and index properly. They won't work as a low level store because they can't find and read information fast enough, so you can't build an engine on top of it that spans the WWW. You COULD build your own relational database that has the low level design that can access the large amounts of information in less then a second and interact it with it through an SQL query engine. But it would be somewhat pointless. SQL is a sledgehammer when it comes to the relatively repetitive accesses google has to perform to complete a search. __________________ "It better be funny" Last edited by kel; 04-10-2004 at 08:18 PM..

04-10-2004, 08:43 PM	#6 (permalink)
Fallon Junkie Location: RI	There's a debian package you can get in apt that's a search engine. just a fyi

04-14-2004, 03:03 PM	#8 (permalink)
tritium Professor of Drinkology	I think Google is the Internet.... The g-Matrix has you. :-P __________________ Blah.