Tilted Forum Project Discussion Community  

Go Back   Tilted Forum Project Discussion Community > Interests > Tilted Technology


 
 
LinkBack Thread Tools
Old 11-01-2004, 11:02 PM   #1 (permalink)
Junkie
 
[HTML and Perl?] Best choice for screen scraping?

Hey all, I have an idea for screen scraping a site and am wondering what I language I should use to quickly and painlessly get my data. It's been a while since I coded, and that was in perl, which seemed to work at the time.

I'm learning processing these days but don't think it's got what I need for this particular task.

I'm thinking that perl will do the trick, but is there another language out there that's come out in the past 4 years I can use that's better suited?

Thanks.
FngKestrel is offline  
Old 11-02-2004, 06:57 AM   #2 (permalink)
Crazy
 
Location: here and there
wtf is 'screen scraping'?
__________________
# chmod 111 /bin/Laden
theFez is offline  
Old 11-02-2004, 07:44 AM   #3 (permalink)
Crazy
 
Location: UK
Perl's built in regular expression functionality and ability to manipulate text would make it the ideal language for screen scraping. You could also write your screen scraping code in XML or JavaScript, but if Perl is what you know then it would be the perfect candidate.

theFez have a quick look at this: Screen Scraping Definition
__________________
and so ends the thought process for another day...
Stug is offline  
Old 11-02-2004, 03:06 PM   #4 (permalink)
Crazy
 
Location: Salt Town, UT
Perl is probably the way to go. I did my last few screen scrapes in PHP, but the grunt of the code is all preg_replace (which is the perl regular expression engine).

Don't try to parse too much, because then even tiny changes to random stuff on the site will mess up your entire scrape. So just go for what you need.

I used an array of possible expressions to find what I needed, and would try them in my order of how confident I was that this was the right expression, because sometimes, different pages have slightly different formatting for no particular reason.
Rawb is offline  
Old 11-02-2004, 06:33 PM   #5 (permalink)
Crazy
 
Location: here and there
ah, i see

perl or php should do the trick. anything with good regular expression support.
__________________
# chmod 111 /bin/Laden
theFez is offline  
Old 11-03-2004, 01:52 AM   #6 (permalink)
Crazy
 
Location: UK
HTML Parser

You might find this useful in your scraping efforts: htmlparser.sourceforge.net
__________________
and so ends the thought process for another day...
Stug is offline  
Old 11-03-2004, 10:13 AM   #7 (permalink)
Junkie
 
Awesome link! Thanks for the info. Time to dust off my O' Reilly book.
FngKestrel is offline  
Old 11-03-2004, 10:24 AM   #8 (permalink)
Junkie
 
Location: Florida
Quote:
Originally Posted by aoeuhtns
Perl is probably the way to go. I did my last few screen scrapes in PHP, but the grunt of the code is all preg_replace (which is the perl regular expression engine).
Same here. Perl has an excellent regex engine. I used it to make a nice little newsfeed app that feeds search terms to news.google.com and parses out the articles.

php is just about as powerful since it uses the perl engine, but it's annoyingly verbose when you're doing a bunch of regex stuff. Example:

php: $input = preg_replace("cat", "dog", $input);
perl: $input =~ s/cat/dog/g;
irseg is offline  
Old 11-05-2004, 10:06 AM   #9 (permalink)
Once upon a time...
 
you could use xslt transforms to parse the page from a base schema to both Xhtml and whatever else you want...
__________________
--
Man Alone
=======
Abstainer: a weak person who yields to the temptation of denying himself a pleasure.
Ambrose Bierce, The Devil's Dictionary.
manalone is offline  
 

Tags
choice, html, perl, scraping, screen


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -8. The time now is 09:13 PM.

Tilted Forum Project

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
Search Engine Optimization by vBSEO 3.6.0 PL2
© 2002-2012 Tilted Forum Project

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54