Tilted Forum Project Discussion Community  

Go Back   Tilted Forum Project Discussion Community > Interests > Tilted Technology


 
 
LinkBack Thread Tools
Old 11-01-2004, 11:02 PM   #1 (permalink)
Junkie
 
[HTML and Perl?] Best choice for screen scraping?

Hey all, I have an idea for screen scraping a site and am wondering what I language I should use to quickly and painlessly get my data. It's been a while since I coded, and that was in perl, which seemed to work at the time.

I'm learning processing these days but don't think it's got what I need for this particular task.

I'm thinking that perl will do the trick, but is there another language out there that's come out in the past 4 years I can use that's better suited?

Thanks.
FngKestrel is offline  
Old 11-02-2004, 06:57 AM   #2 (permalink)
Crazy
 
Location: here and there
wtf is 'screen scraping'?
__________________
# chmod 111 /bin/Laden
theFez is offline  
Old 11-02-2004, 07:44 AM   #3 (permalink)
Crazy
 
Location: UK
Perl's built in regular expression functionality and ability to manipulate text would make it the ideal language for screen scraping. You could also write your screen scraping code in XML or JavaScript, but if Perl is what you know then it would be the perfect candidate.

theFez have a quick look at this: Screen Scraping Definition
__________________
and so ends the thought process for another day...
Stug is offline  
Old 11-02-2004, 03:06 PM   #4 (permalink)
Crazy
 
Location: Salt Town, UT
Perl is probably the way to go. I did my last few screen scrapes in PHP, but the grunt of the code is all preg_replace (which is the perl regular expression engine).

Don't try to parse too much, because then even tiny changes to random stuff on the site will mess up your entire scrape. So just go for what you need.

I used an array of possible expressions to find what I needed, and would try them in my order of how confident I was that this was the right expression, because sometimes, different pages have slightly different formatting for no particular reason.
Rawb is offline  
Old 11-02-2004, 06:33 PM   #5 (permalink)
Crazy
 
Location: here and there
ah, i see

perl or php should do the trick. anything with good regular expression support.
__________________
# chmod 111 /bin/Laden
theFez is offline  
Old 11-03-2004, 01:52 AM   #6 (permalink)
Crazy
 
Location: UK
HTML Parser

You might find this useful in your scraping efforts: htmlparser.sourceforge.net
__________________
and so ends the thought process for another day...
Stug is offline  
Old 11-03-2004, 10:13 AM   #7 (permalink)
Junkie
 
Awesome link! Thanks for the info. Time to dust off my O' Reilly book.
FngKestrel is offline  
Old 11-03-2004, 10:24 AM   #8 (permalink)
Junkie
 
Location: Florida
Quote:
Originally Posted by aoeuhtns
Perl is probably the way to go. I did my last few screen scrapes in PHP, but the grunt of the code is all preg_replace (which is the perl regular expression engine).
Same here. Perl has an excellent regex engine. I used it to make a nice little newsfeed app that feeds search terms to news.google.com and parses out the articles.

php is just about as powerful since it uses the perl engine, but it's annoyingly verbose when you're doing a bunch of regex stuff. Example:

php: $input = preg_replace("cat", "dog", $input);
perl: $input =~ s/cat/dog/g;
irseg is offline  
Old 11-05-2004, 10:06 AM   #9 (permalink)
Once upon a time...
 
you could use xslt transforms to parse the page from a base schema to both Xhtml and whatever else you want...
__________________
--
Man Alone
=======
Abstainer: a weak person who yields to the temptation of denying himself a pleasure.
Ambrose Bierce, The Devil's Dictionary.
manalone is offline  
 

Tags
choice, html, perl, scraping, screen


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -8. The time now is 02:41 AM.

Tilted Forum Project

Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
Search Engine Optimization by vBSEO 3.6.0 PL2
© 2002-2012 Tilted Forum Project

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360