I'm not a CS major, this is for a Data Mining class, and this is just part of the data preprocessing, before we can start analyzing the data, so its not critical what method we use to process the data, just that we do process it.
and the data is sorted by userid, and then date, so it should be fine assuming all the userids are in consecutive blocks.
as far as loading everything up into a hash the dataset is 2.1GB so that prolly wouldnt work.
my partner just figured one out that seems to work on a small part of the dataset.
Code:
#!/usr/bin/perl
$id = 0;
#$i = 0;
#$j = 0;
$q = "";
open (IN, "Test");
open (OUT, ">TestResult");
while (<IN>){
chomp; #removes all newline characters
@a = split /\t/; #the whole dataset, not just a line
if ($id == $a[0]){
if (!($a[1] =~ m/[\-]$/)){ #regex FTW!
$q = $q . "; $a[1]";
}
#$i++;
}
else{
#$j++;
print OUT "$id\t$q\t\n";
#modify this print statement to display the number of terms this user
#searched for
$id = $a[0];
$q = $a[1];
#$i = 1;
}
}
close(IN);
close(OUT);
but for some reason I'm getting "bash: ./group_ID.pl: /usr/bin/perl: bad interpreter: Permission denied" when I try to run it... just reinstalled ubuntu this morning... must have something to do with that...