![]() |
[Perl]Text file manipulation help
I'm doing a Data mining project and am trying to get the data processed into a more usable format. I'm using some aol search data that was released a few months ago. my professor recommended I take all the queries for one userID and group them into one field and delete all data except the userids and queries.
My prof whipped up a quick perl script that works, but It includes the queries that are empty, that show a "-" I would like to remove the queries that have dashes, and remove the users whose queries are already only dashes. so basically it needs to delete the rows that have a "-" in Query it is alright if I use another script to process the original data deleting those rows and then run the script my prof gave me to group it all together. if you look at the data I have put down below you can see user 217 has several dashes. if its not clear lemme know and I'll try to explain better. This is the script my professor wrote. Code:
#!/usr/bin/perl This is a small part of the dataset the columns go Code:
AnonID Query QueryTime ItemRank ClickURL Code:
142 rentdirect.com 2006-03-01 07:17:12 this is the output with dashes Code:
0 0 |
are you a cs major? do you understand what your prof wrote? you should understand it and try to solve it yourself, otherwise the later perl scripts your prof tells you to write will only get harder..
what your prof wrote assumes the same user ID only occurs in consecutive blocks. if it occurs in different blocks, then each user id will be printed out on different liens. here is my solution, haven't tested it but it should work.. if ($id == $a[0]){ if ($a[1] != '\-') { // skip the dashes $q = $q . "; $a[1]"; $i++; } } else{ $j++; if ($i != 0) { // does not print if 0 occurences of id (ie all dashes) print OUT "$id\t$q\t$i\n"; } $id = $a[0]; if ($a[1] == '\-') { $q = ""; $i = 0; } else { $q = $a[1]; $i = 1; } } |
There all sorts of tests you might add to the condition that makes us save the query. As match000 points out, if the user's records are intermingled, this script doesn't quite solve.
Personally, I'd use a hash to store up all the queries for each userid, and dump them all at once. Assuming the data set is small enough to fit in available memory, of course... |
I'm not a CS major, this is for a Data Mining class, and this is just part of the data preprocessing, before we can start analyzing the data, so its not critical what method we use to process the data, just that we do process it.
and the data is sorted by userid, and then date, so it should be fine assuming all the userids are in consecutive blocks. as far as loading everything up into a hash the dataset is 2.1GB so that prolly wouldnt work. my partner just figured one out that seems to work on a small part of the dataset. Code:
#!/usr/bin/perl but for some reason I'm getting "bash: ./group_ID.pl: /usr/bin/perl: bad interpreter: Permission denied" when I try to run it... just reinstalled ubuntu this morning... must have something to do with that... |
your partner's solution does not take care of the case where the userID has a single occurence AND has a query of '-'.
my soln offers one way of taking care of this.. also for my soln, move j++ inside the if block that has the print statement. i am assuming j counts the number of userID (blocks) printed.. |
match, when I run your script on the small dataset I just get the output
"142 ; 207 ad2d 530; 207 ad2d 530 2" thanks for all the help, its really appreciated. do you have any good online tutorials for perl that I can check out, I can follow the logic well enough, but the syntax just seems weird to me. looks like my partners script has problems with some locations where there is only one userid and it doesnt get all the dashes... |
dunno, could be logical error or most likely could be the way I do the character (dash '-') comparison..
$a[1] == '\-' might be not correct. i am assuming - is a special character so you have to use '\-' but maybe you just do '-' if its not a special character. also, i don't know if in perl you can do == '\-', might have to do a regExp like your partner did.. something like: $a[1] =~ /\s*\-\s*/ (not sure if this is correct, but you get the picture) hope this helps for tutorials, google: perl tutorial, the first hit is very outdated but I read it anyways (in Jan) to learn Perl.. also google: stanford perl tutorial, i think that gives a good one.. |
Well. First things first.
Add: use warnings; use strict; Right below the #!/usr/bin/perl line in your script. Do this for all your perl scripts, great or small. Tell your professor I told him to do this as well. Also, I noticed that $j is never used. Weird, but whatever. Anyway, the test I decided to use was /\w\w/ - ie 'does the supposed URL contain 2 'word' characters (a-z, plus dash) this rules out a single dash, and a bunch of other possible bogus data. As an aside, apparently tfproject's software doesn't properly escape '>' and '<' - so you need to change them to '&gt;' and '&lt;' respectively when you post code, even inside a 'code' block - this is what caused the 'while' line to not show up correctly. Code:
#!/usr/bin/perl |
robot parade I got the scripts running. and what do the use warnings/strict do?
your script is very close, only problem I see is when the field with the '-' in it is the first one for that ID it still includes it. this wouldnt have shown up using the little example dataset I provided. I'm thinking that the way to go may to be have 2 different scripts. one that just deletes the rows with '-' in the query field. then after running it I run the one that my prof wrote to cluster it. heres a link to where I downloaded the datasets from originally. http://www.gregsadetsky.com/aol-data/ its a really interesting dataset just to open up and look at and see what people are searching for. |
Quote:
basically you are not cathcing the case where there is a userID that has only ONE entry, and that one entry is a '-'.. |
Quote:
Quote:
|
Quote:
Oops - good catch. Duplicating the if() with the regex is probably the right way to go here. -RN Quote:
I think match000's solution of testing for a '--' in the 'else' clause too is probably the best way to go there. |
All times are GMT -8. The time now is 03:29 AM. |
Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
Search Engine Optimization by vBSEO 3.6.0 PL2
© 2002-2012 Tilted Forum Project