11-01-2004, 11:02 PM | #1 (permalink) |
Junkie
|
[HTML and Perl?] Best choice for screen scraping?
Hey all, I have an idea for screen scraping a site and am wondering what I language I should use to quickly and painlessly get my data. It's been a while since I coded, and that was in perl, which seemed to work at the time.
I'm learning processing these days but don't think it's got what I need for this particular task. I'm thinking that perl will do the trick, but is there another language out there that's come out in the past 4 years I can use that's better suited? Thanks. |
11-02-2004, 07:44 AM | #3 (permalink) |
Crazy
Location: UK
|
Perl's built in regular expression functionality and ability to manipulate text would make it the ideal language for screen scraping. You could also write your screen scraping code in XML or JavaScript, but if Perl is what you know then it would be the perfect candidate.
theFez have a quick look at this: Screen Scraping Definition
__________________
and so ends the thought process for another day... |
11-02-2004, 03:06 PM | #4 (permalink) |
Crazy
Location: Salt Town, UT
|
Perl is probably the way to go. I did my last few screen scrapes in PHP, but the grunt of the code is all preg_replace (which is the perl regular expression engine).
Don't try to parse too much, because then even tiny changes to random stuff on the site will mess up your entire scrape. So just go for what you need. I used an array of possible expressions to find what I needed, and would try them in my order of how confident I was that this was the right expression, because sometimes, different pages have slightly different formatting for no particular reason. |
11-03-2004, 01:52 AM | #6 (permalink) |
Crazy
Location: UK
|
HTML Parser
You might find this useful in your scraping efforts: htmlparser.sourceforge.net
__________________
and so ends the thought process for another day... |
11-03-2004, 10:24 AM | #8 (permalink) | |
Junkie
Location: Florida
|
Quote:
php is just about as powerful since it uses the perl engine, but it's annoyingly verbose when you're doing a bunch of regex stuff. Example: php: $input = preg_replace("cat", "dog", $input); perl: $input =~ s/cat/dog/g; |
|
11-05-2004, 10:06 AM | #9 (permalink) |
Once upon a time...
|
you could use xslt transforms to parse the page from a base schema to both Xhtml and whatever else you want...
__________________
-- Man Alone ======= Abstainer: a weak person who yields to the temptation of denying himself a pleasure. Ambrose Bierce, The Devil's Dictionary. |
Tags |
choice, html, perl, scraping, screen |
|
|