05-04-2007, 05:13 PM | #1 (permalink) |
Poo-tee-weet?
Location: The Woodlands, TX
|
[Perl]Text file manipulation help
I'm doing a Data mining project and am trying to get the data processed into a more usable format. I'm using some aol search data that was released a few months ago. my professor recommended I take all the queries for one userID and group them into one field and delete all data except the userids and queries.
My prof whipped up a quick perl script that works, but It includes the queries that are empty, that show a "-" I would like to remove the queries that have dashes, and remove the users whose queries are already only dashes. so basically it needs to delete the rows that have a "-" in Query it is alright if I use another script to process the original data deleting those rows and then run the script my prof gave me to group it all together. if you look at the data I have put down below you can see user 217 has several dashes. if its not clear lemme know and I'll try to explain better. This is the script my professor wrote. Code:
#!/usr/bin/perl $id = 0; $i = 0; $j = 0; $q = ""; open (IN, "test"); open (OUT, ">testresult"); while (<IN>){ chomp; @a = split /\t/; if ($id == $a[0]){ $q = $q . "; $a[1]"; $i++; } else{ $j++; print OUT "$id\t$q\t$i\n"; $id = $a[0]; $q = $a[1]; $i = 1; } } close(IN); close(OUT); This is a small part of the dataset the columns go Code:
AnonID Query QueryTime ItemRank ClickURL Code:
142 rentdirect.com 2006-03-01 07:17:12 142 www.prescriptionfortime.com 2006-03-12 12:31:06 142 staple.com 2006-03-17 21:19:29 142 staple.com 2006-03-17 21:19:45 142 www.newyorklawyersite.com 2006-03-18 08:02:58 142 www.newyorklawyersite.com 2006-03-18 08:03:09 142 westchester.gov 2006-03-20 03:55:57 1 http://www.westchestergov.com 142 space.comhttp 2006-03-24 20:51:24 142 dfdf 2006-03-24 22:23:07 142 dfdf 2006-03-24 22:23:14 142 vaniqa.comh 2006-03-25 23:27:12 142 www.collegeucla.edu 2006-04-03 21:12:14 142 www.elaorg 2006-04-03 21:25:20 142 207 ad2d 530 2006-04-08 01:31:04 142 207 ad2d 530 2006-04-08 01:31:14 1 http://www.courts.state.ny.us 142 broadway.vera.org 2006-04-08 08:38:23 142 broadway.vera.org 2006-04-08 08:38:31 142 vera.org 2006-04-08 08:38:42 1 http://www.vera.org 142 broadway.vera.org 2006-04-08 08:39:30 142 frankmellace.com 2006-04-09 02:19:24 142 ucs.ljx.com 2006-04-09 02:20:44 142 attornyleslie.com 2006-04-13 00:25:27 142 merit release appearance 2006-04-22 23:51:18 142 www.bonsai.wbff.org 2006-05-06 08:49:34 142 loislaw.com 2006-05-12 22:43:36 142 rapny.com 2006-05-18 09:21:57 142 whitepages.com 2006-05-19 19:36:31 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com 217 ameriprise.com 2006-03-01 14:06:23 1 http://www.ameriprise.com 217 susheme 2006-03-02 12:31:08 217 united.com 2006-03-03 14:54:13 217 mizuno.com 2006-03-07 22:41:17 1 http://www.mizuno.com 217 p; .; p;' p; ' ;' ;'; 2006-03-09 12:09:27 217 p; .; p;' p; ' ;' ;'; 2006-03-09 12:09:35 217 asiansexygoddess.com 2006-03-16 14:31:36 1 http://www.asiansexygoddess.com 217 buddylis 2006-03-16 15:23:33 217 bestasiancompany.com 2006-03-20 15:15:43 1 http://www.bestasiancompany.com 217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com 217 lottery 2006-03-27 16:34:59 1 http://www.calottery.com 217 ask.com 2006-03-31 14:31:10 1 http://www.ask.com 217 weather.com 2006-03-31 18:00:56 217 wellsfargo.com 2006-04-03 16:57:54 217 www.tabiecummings.com 2006-05-04 17:45:57 217 wanttickets.com 2006-05-16 15:44:38 217 yahoo.com 2006-05-16 16:35:31 217 - 2006-05-18 18:20:10 1 http://www.theonering.net 217 www.ngo-quen.org 2006-05-22 15:49:47 217 - 2006-05-22 16:48:42 217 vietnam 2006-05-22 17:43:42 217 vietnam 2006-05-22 17:43:42 217 vietnam 2006-05-22 17:43:44 217 vietnam 2006-05-22 18:03:24 217 vietnam 2006-05-22 18:03:24 217 vietnam 2006-05-22 18:03:27 217 - 2006-05-23 15:41:48 993 myspace.co 2006-03-01 12:13:36 993 myspace.com 2006-03-01 12:13:41 993 googl 2006-03-01 15:03:25 993 chasebadkids.net 2006-03-03 16:55:48 1 http://www.chasebadkids.net 1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com 1268 www.ghostrockranch.com 2006-03-04 13:58:23 1268 openrangeht.zachsairforce.com 2006-03-09 22:38:45 1268 sstack.com 2006-03-11 00:17:09 1268 www.mecab.org 2006-03-12 18:59:26 1268 www.raindanceexpress.com 2006-03-18 20:13:01 1268 www.victoriacostumiere.com 2006-03-19 00:26:51 1268 osteen-schaztberg.com 2006-03-21 17:55:25 1268 osteen-schatzberg.com 2006-03-21 17:55:42 1 http://www.osteen-schatzberg.com 1268 osteen-schatzberg.com 2006-03-21 17:55:42 2 http://www.osteen-schatzberg.com 1268 www.buckmountianestates.com 2006-03-24 18:53:10 1268 idx.techsolsc.com 2006-05-07 00:58:21 1268 www.bridleandbit.com 2006-05-09 21:34:23 1268 gall stones 2006-05-11 02:12:51 1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov 1268 http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00 2006-05-12 00:09:54 1268 http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00 2006-05-12 00:10:26 1268 href a href alt a http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00 2006-05-12 01:28:30 1268 http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00 2006-05-12 10:41:27 1268 www.acevedoarabians.com 2006-05-21 21:28:21 1268 adbuyer3.lycos.com 2006-05-31 14:10:52 1268 www.pinerplantation.com 2006-05-31 21:24:08 1268 www.pinerplantation.com 2006-05-31 21:24:30 1268 www.pinerplantation.com 2006-05-31 21:24:56 1326 files 2006-03-01 17:36:08 1326 www.kmcwheel.com 2006-03-06 17:31:55 1326 dellcomputers 2006-03-06 20:09:58 1326 www.ameicaneaglewheel.com 2006-03-09 19:09:52 1326 cascadefamilymedical 2006-03-14 11:36:57 1326 cascadefamilymedical.com 2006-03-14 11:39:49 1326 milaniwheel.com 2006-03-14 12:37:30 1326 www.ameicaneaglewheel.com 2006-03-14 18:53:20 1326 www.ameicaneaglewheel.com 2006-03-15 12:27:48 1326 pop up adds 2006-03-15 20:07:38 1326 pop up adds 2006-03-15 20:08:29 1326 the childs wonderland company 2006-03-21 11:50:10 1326 the child's wonderland company 2006-03-21 11:59:03 6 http://www.wonderlandtheatre.com 1326 the child's wonderland company 2006-03-21 12:00:55 1326 the child's wonderland company grand rapids michigan 2006-03-21 12:01:24 1326 the child's wonderland company grand rapids michigan 2006-03-21 12:01:59 1326 the childs wonderland co. 2006-03-21 21:20:42 1326 the child's wonderland co. 2006-03-21 21:22:16 1326 www.ameicaneaglewheel.com 2006-03-22 12:23:07 1326 www.budget rentals.com 2006-03-24 18:26:10 1326 budget truck rental 2006-03-24 18:27:07 1326 adr wheels 2006-03-28 12:53:39 1326 adr wheels 2006-03-28 12:57:04 1326 holiday mansion houseboat 2006-03-29 17:14:01 5 http://www.everyboat.com 1326 back to the future 2006-04-01 17:59:28 1 http://www.imdb.com 1326 holiday mansion houseboat 2006-04-06 20:20:43 1 http://www.iboats.com 1326 www.ameicaneaglewheel.com 2006-04-10 14:04:49 1326 www.ameicaneaglewheel.com 2006-04-10 14:05:15 1326 the childs wonderland company 2006-04-11 17:25:27 1326 konig wheels 2006-04-18 13:29:52 2 http://www.konigwheels.com 1326 konig wheels 2006-04-18 13:29:52 1 http://www.konigwheels.com 1326 jet blue airlines 2006-04-27 15:29:05 1326 coats tire equipment 2006-04-28 15:53:18 1326 coats tire equipment 2006-05-03 19:15:01 1326 verizon wireless 2006-05-09 00:09:22 1326 www.crazyradiodeals.com 2006-05-23 18:00:30 1337 uslandrecords.com 2006-03-01 11:50:34 1 http://www.seda-cog.org 1337 titlesourcein.com 2006-03-14 15:45:07 1337 titlesourceinc 2006-03-14 15:45:55 1 http://www.titlesourceinc.com 1337 select business services 2006-03-14 15:51:41 1337 select business services title 2006-03-14 15:52:10 1337 cbc companies 2006-03-14 15:52:44 2 http://www.cbc-companies.com 1337 cbc companies 2006-03-14 15:52:44 3 http://www.cbc-companies.com 1337 cbc companies 2006-03-14 15:52:44 4 http://www.mktgservices.com 1337 national real estate settlement services 2006-03-14 15:59:13 1 http://www.realtms.com 1337 national real estate settlement services 2006-03-14 15:59:13 7 http://dmoz.org 1337 pennsylvania real estate settlement services 2006-03-14 16:04:40 1337 pennsylvania real estate settlement services 2006-03-14 16:05:11 1337 sunbury pennsylvania real estate settlement services 2006-03-14 16:05:47 1337 sunbury pennsylvania real estate settlement services 2006-03-14 16:06:28 14 http://pa.optimuslaw.com 1337 atm corporation 2006-03-15 13:46:55 1 http://www.atmprof.com 1337 cheasapeake appraisal and settlement services 2006-03-15 13:50:56 1 http://www.johnkvaluation.com 1337 chesapeake appraisal and settlement services 2006-03-15 13:51:52 10 http://www.citigroup.com 1337 pauslandrecords.com 2006-03-20 09:40:50 1337 pa.uslandrecords.com 2006-03-20 09:41:08 2 http://www.seda-cog.org 1337 first american lenders advantage 2006-03-22 16:05:56 1 http://www.firstam.com 1337 first american chesapeake 2006-03-22 16:11:31 1337 first american chesapeake title services 2006-03-22 16:11:50 2 http://www.tavma.com 1337 www.national-reis.com 2006-03-22 16:16:56 1337 www.americantitleinc.com 2006-03-22 16:19:23 1337 www.aculinkms.com 2006-03-22 16:19:31 1337 united one resources 2006-03-22 17:47:11 1 http://www.unitedoneresources.com 1337 credit plus solutions group 2006-03-22 17:52:53 1337 credit plus solutions group 2006-03-22 17:54:09 1 http://www.cpsg.com 1337 security search and abstract 2006-03-22 17:56:19 1 http://www.securitysearchabstract.com 1337 searchtec 2006-03-22 17:58:46 1 http://www.searchtec.com 1337 searchtec 2006-03-22 17:58:46 1 http://www.searchtec.com 1337 fiserv 2006-03-24 14:05:01 1 http://www.fiserv.com 1337 fiserv 2006-03-24 14:05:01 3 http://www.fiservlendingsolutions.com 1337 fiserv 2006-03-24 14:05:01 2 http://www.fiservinsurance.com 1337 fiserv 2006-03-24 14:05:01 3 http://www.fiservlendingsolutions.com 1337 integrated real estate 2006-03-27 14:52:29 1 http://www.integratedreal.com 1337 integrated real estate 2006-03-27 14:52:29 2 http://www.irisnet.net this is the output with dashes Code:
0 0 142 rentdirect.com; www.prescriptionfortime.com; staple.com; staple.com; www.newyorklawyersite.com; www.newyorklawyersite.com; westchester.gov; space.comhttp; dfdf; dfdf; vaniqa.comh; www.collegeucla.edu; www.elaorg; 207 ad2d 530; 207 ad2d 530; broadway.vera.org; broadway.vera.org; vera.org; broadway.vera.org; frankmellace.com; ucs.ljx.com; attornyleslie.com; merit release appearance; www.bonsai.wbff.org; loislaw.com; rapny.com; whitepages.com 27 217 lottery; lottery; ameriprise.com; susheme; united.com; mizuno.com; p; .; p;' p; ' ;' ;';; p; .; p;' p; ' ;' ;';; asiansexygoddess.com; buddylis; bestasiancompany.com; lottery; lottery; ask.com; weather.com; wellsfargo.com; www.tabiecummings.com; wanttickets.com; yahoo.com; -; www.ngo-quen.org; -; vietnam; vietnam; vietnam; vietnam; vietnam; vietnam; - 29 993 myspace.co; myspace.com; googl; chasebadkids.net 4 1268 ozark horse blankets; www.ghostrockranch.com; openrangeht.zachsairforce.com; sstack.com; www.mecab.org; www.raindanceexpress.com; www.victoriacostumiere.com; osteen-schaztberg.com; osteen-schatzberg.com; osteen-schatzberg.com; www.buckmountianestates.com; idx.techsolsc.com; www.bridleandbit.com; gall stones; gallstones; http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00; http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00; href a href alt a http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00; http www.flickr.com photos 88145967 n00 24368586 in pool-32148876 n00; www.acevedoarabians.com; adbuyer3.lycos.com; www.pinerplantation.com; www.pinerplantation.com; www.pinerplantation.com 24 1326 files; www.kmcwheel.com; dellcomputers; www.ameicaneaglewheel.com; cascadefamilymedical; cascadefamilymedical.com; milaniwheel.com; www.ameicaneaglewheel.com; www.ameicaneaglewheel.com; pop up adds; pop up adds; the childs wonderland company; the child's wonderland company; the child's wonderland company; the child's wonderland company grand rapids michigan; the child's wonderland company grand rapids michigan; the childs wonderland co.; the child's wonderland co.; www.ameicaneaglewheel.com; www.budget rentals.com; budget truck rental; adr wheels; adr wheels; holiday mansion houseboat; back to the future; holiday mansion houseboat; www.ameicaneaglewheel.com; www.ameicaneaglewheel.com; the childs wonderland company; konig wheels; konig wheels; jet blue airlines; coats tire equipment; coats tire equipment; verizon wireless; www.crazyradiodeals.com 36
__________________
-=JStrider=- ~Clatto Verata Nicto |
05-05-2007, 10:59 AM | #2 (permalink) |
Psycho
|
are you a cs major? do you understand what your prof wrote? you should understand it and try to solve it yourself, otherwise the later perl scripts your prof tells you to write will only get harder..
what your prof wrote assumes the same user ID only occurs in consecutive blocks. if it occurs in different blocks, then each user id will be printed out on different liens. here is my solution, haven't tested it but it should work.. if ($id == $a[0]){ if ($a[1] != '\-') { // skip the dashes $q = $q . "; $a[1]"; $i++; } } else{ $j++; if ($i != 0) { // does not print if 0 occurences of id (ie all dashes) print OUT "$id\t$q\t$i\n"; } $id = $a[0]; if ($a[1] == '\-') { $q = ""; $i = 0; } else { $q = $a[1]; $i = 1; } } |
05-05-2007, 12:45 PM | #3 (permalink) |
Darth Papa
Location: Yonder
|
There all sorts of tests you might add to the condition that makes us save the query. As match000 points out, if the user's records are intermingled, this script doesn't quite solve.
Personally, I'd use a hash to store up all the queries for each userid, and dump them all at once. Assuming the data set is small enough to fit in available memory, of course... |
05-05-2007, 01:23 PM | #4 (permalink) |
Poo-tee-weet?
Location: The Woodlands, TX
|
I'm not a CS major, this is for a Data Mining class, and this is just part of the data preprocessing, before we can start analyzing the data, so its not critical what method we use to process the data, just that we do process it.
and the data is sorted by userid, and then date, so it should be fine assuming all the userids are in consecutive blocks. as far as loading everything up into a hash the dataset is 2.1GB so that prolly wouldnt work. my partner just figured one out that seems to work on a small part of the dataset. Code:
#!/usr/bin/perl $id = 0; #$i = 0; #$j = 0; $q = ""; open (IN, "Test"); open (OUT, ">TestResult"); while (<IN>){ chomp; #removes all newline characters @a = split /\t/; #the whole dataset, not just a line if ($id == $a[0]){ if (!($a[1] =~ m/[\-]$/)){ #regex FTW! $q = $q . "; $a[1]"; } #$i++; } else{ #$j++; print OUT "$id\t$q\t\n"; #modify this print statement to display the number of terms this user #searched for $id = $a[0]; $q = $a[1]; #$i = 1; } } close(IN); close(OUT); but for some reason I'm getting "bash: ./group_ID.pl: /usr/bin/perl: bad interpreter: Permission denied" when I try to run it... just reinstalled ubuntu this morning... must have something to do with that...
__________________
-=JStrider=- ~Clatto Verata Nicto |
05-05-2007, 01:35 PM | #5 (permalink) |
Psycho
|
your partner's solution does not take care of the case where the userID has a single occurence AND has a query of '-'.
my soln offers one way of taking care of this.. also for my soln, move j++ inside the if block that has the print statement. i am assuming j counts the number of userID (blocks) printed.. |
05-05-2007, 02:09 PM | #6 (permalink) |
Poo-tee-weet?
Location: The Woodlands, TX
|
match, when I run your script on the small dataset I just get the output
"142 ; 207 ad2d 530; 207 ad2d 530 2" thanks for all the help, its really appreciated. do you have any good online tutorials for perl that I can check out, I can follow the logic well enough, but the syntax just seems weird to me. looks like my partners script has problems with some locations where there is only one userid and it doesnt get all the dashes...
__________________
-=JStrider=- ~Clatto Verata Nicto Last edited by JStrider; 05-05-2007 at 02:21 PM.. Reason: Automerged Doublepost |
05-05-2007, 03:20 PM | #7 (permalink) |
Psycho
|
dunno, could be logical error or most likely could be the way I do the character (dash '-') comparison..
$a[1] == '\-' might be not correct. i am assuming - is a special character so you have to use '\-' but maybe you just do '-' if its not a special character. also, i don't know if in perl you can do == '\-', might have to do a regExp like your partner did.. something like: $a[1] =~ /\s*\-\s*/ (not sure if this is correct, but you get the picture) hope this helps for tutorials, google: perl tutorial, the first hit is very outdated but I read it anyways (in Jan) to learn Perl.. also google: stanford perl tutorial, i think that gives a good one.. |
05-05-2007, 05:31 PM | #8 (permalink) |
Junkie
Location: San Antonio, TX
|
Well. First things first.
Add: use warnings; use strict; Right below the #!/usr/bin/perl line in your script. Do this for all your perl scripts, great or small. Tell your professor I told him to do this as well. Also, I noticed that $j is never used. Weird, but whatever. Anyway, the test I decided to use was /\w\w/ - ie 'does the supposed URL contain 2 'word' characters (a-z, plus dash) this rules out a single dash, and a bunch of other possible bogus data. As an aside, apparently tfproject's software doesn't properly escape '>' and '<' - so you need to change them to '&gt;' and '&lt;' respectively when you post code, even inside a 'code' block - this is what caused the 'while' line to not show up correctly. Code:
#!/usr/bin/perl use warnings; use strict; my $id = 0; my $i = 0; my $j = 0; my $q = ""; open (IN, "test"); open (OUT, ">testresult"); while (<IN>){ chomp; my @a = split /\t/; if ($id == $a[0]) { if ($a[1] =~ /\w\w/) { $q = $q . "; $a[1]"; $i++; } } else { $j++; print OUT "$id\t$q\t$i\n"; $id = $a[0]; $q = $a[1]; $i = 1; } } close(IN); close(OUT); |
05-05-2007, 07:01 PM | #9 (permalink) |
Poo-tee-weet?
Location: The Woodlands, TX
|
robot parade I got the scripts running. and what do the use warnings/strict do?
your script is very close, only problem I see is when the field with the '-' in it is the first one for that ID it still includes it. this wouldnt have shown up using the little example dataset I provided. I'm thinking that the way to go may to be have 2 different scripts. one that just deletes the rows with '-' in the query field. then after running it I run the one that my prof wrote to cluster it. heres a link to where I downloaded the datasets from originally. http://www.gregsadetsky.com/aol-data/ its a really interesting dataset just to open up and look at and see what people are searching for.
__________________
-=JStrider=- ~Clatto Verata Nicto |
05-05-2007, 07:12 PM | #10 (permalink) | |
Psycho
|
Quote:
basically you are not cathcing the case where there is a userID that has only ONE entry, and that one entry is a '-'.. |
|
05-05-2007, 08:53 PM | #11 (permalink) | ||
Darth Papa
Location: Yonder
|
Quote:
Quote:
|
||
05-06-2007, 09:51 AM | #12 (permalink) | ||
Junkie
Location: San Antonio, TX
|
Quote:
Oops - good catch. Duplicating the if() with the regex is probably the right way to go here. -RN Quote:
I think match000's solution of testing for a '--' in the 'else' clause too is probably the best way to go there. Last edited by robot_parade; 05-06-2007 at 09:55 AM.. Reason: Automerged Doublepost |
||
Tags |
file, manipulation, perltext |
|
|