Remove duplicates entries from multiple text file in perl?


I am new to this site,need help to remove duplicate entries from multiple text file(in a loop).tried the below code but this is not removing duplicates for multiple files,however it is working for a single file.

Code :

my $file = "$Log_dir/File_listing.txt"; my $outfile = "$Log_dir/Remove_duplicate.txt";; open (IN, "<$file") or die "Couldn't open input file: $!"; open (OUT, ">$outfile") or die "Couldn't open output file: $!"; my %seen = (); { my @ARGV = ($file); # local $^I = '.bac'; while(<IN>){ print OUT $seen{$_}++; next if $seen{$_} > 1; print OUT ; } }

Thanks, arts


The errors in your script:

<ul><li>You overwrite (a new copy of) @ARGV with $file, so it can never have any more file arguments.</li> <li>...which doesn't matter, because you open the file handle before you assign to @ARGV, plus you do not loop around the arguments, you just have a block { ... } around the code that serves no purpose.</li> <li>%seen will contain dedupe data for all the files you open unless you reset it.</li> <li>You print the count $seen{$_} to the output file, which I am sure you don't need.</li> </ul><hr />

You could use the implicit open of @ARGV arguments using the diamond operator, but since you (probably) need to assign a proper output file name for each new file, that is an unwanted complication with such a solution.

use strict; use warnings; # always use these for my $file (@ARGV) { # loop over all file names my $out = "$file.deduped"; # create output file name open my $infh, "<", $file or die "$file: $!"; open my $outfh, ">", $out or die "$out: $!"; my %seen; while (<$infh>) { print $outfh $_ if !$seen{$_}++; # print if a line is never seen before } }

Note that using a lexically scoped %seen variable makes the script check for duplicates inside each individual file. If you move the variable outside the for loop, you will check for duplicates across <em>all</em> files. I am not sure which you prefer.


I think your File_listing.txt contains lines, some of which have multiple occurences? If that's the case, just use the bash shell:

sort --unique <File_listing.txt >Remove_duplicate.txt

Or, if you prefer Perl:

perl -lne '$seen{$_}++ and next or print;' <File_listing.txt >Remove_duplicate.txt


