![]() |
[HTML and Perl?] Best choice for screen scraping?
Hey all, I have an idea for screen scraping a site and am wondering what I language I should use to quickly and painlessly get my data. It's been a while since I coded, and that was in perl, which seemed to work at the time.
I'm learning processing these days but don't think it's got what I need for this particular task. I'm thinking that perl will do the trick, but is there another language out there that's come out in the past 4 years I can use that's better suited? Thanks. |
wtf is 'screen scraping'?
|
Perl's built in regular expression functionality and ability to manipulate text would make it the ideal language for screen scraping. You could also write your screen scraping code in XML or JavaScript, but if Perl is what you know then it would be the perfect candidate.
theFez have a quick look at this: Screen Scraping Definition |
Perl is probably the way to go. I did my last few screen scrapes in PHP, but the grunt of the code is all preg_replace (which is the perl regular expression engine).
Don't try to parse too much, because then even tiny changes to random stuff on the site will mess up your entire scrape. So just go for what you need. I used an array of possible expressions to find what I needed, and would try them in my order of how confident I was that this was the right expression, because sometimes, different pages have slightly different formatting for no particular reason. |
ah, i see
perl or php should do the trick. anything with good regular expression support. |
HTML Parser
You might find this useful in your scraping efforts: htmlparser.sourceforge.net
|
Awesome link! Thanks for the info. Time to dust off my O' Reilly book. :D
|
Quote:
php is just about as powerful since it uses the perl engine, but it's annoyingly verbose when you're doing a bunch of regex stuff. Example: php: $input = preg_replace("cat", "dog", $input); perl: $input =~ s/cat/dog/g; |
you could use xslt transforms to parse the page from a base schema to both Xhtml and whatever else you want...
|
All times are GMT -8. The time now is 09:38 PM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
Search Engine Optimization by vBSEO 3.6.0 PL2
© 2002-2012 Tilted Forum Project