从两个文件中提取相同的行,同时忽略低/大写

时间:2014-09-02 08:33:39

标签: python bash perl unix awk

目的是从两个文件中提取相同的行,同时忽略低/大写并且忽略标点符号

我有两个文件

source.txt

Foo bar
blah blah black sheep
Hello World
Kick the, bucket

processed.txt

foo bar
blah sheep black
Hello world
kick the bucket ,

所需输出(来自source.txt):

Foo bar

Hello World
Kick the, bucket

我一直这样做:

from string import punctuation
with open('source.txt', 'r') as f1, open('processed.txt', 'r') as f2:
  for i,j in zip(f1, f2):
    lower_depunct_f1 = " ".join("".join([ch.lower() for ch in f1 if f1 not in punctuation]).split())
    lower_depunct_f2 = " ".join("".join([ch.lower() for ch in f2 if f2 not in punctuation]).split())
    if lower_depunct_f1 == lower_depunct_f2:
      print f1
    else:
      print

有没有办法用bash工具执行此操作? perl,shell,awk,sed?

4 个答案:

答案 0 :(得分:2)

使用awk更容易做到这一点:

awk 'FNR==NR {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); a[s]++;next}
   {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); print (s in a)?$0:""}' file2 file1
Foo bar

Hello World
Kick the, bucket

答案 1 :(得分:2)

Perl解决方案非常类似于Python:

open my $S1, '<', 'source.txt'    or die $!;
open my $S2, '<', 'processed.txt' or die $!;
while (defined(my $s1 = <$S1>) and defined (my $s2 = <$S2>)) {
    s/[[:punct:]]//g for $s1, $s2;
    $_ = lc for $s1, $s2;
    print $s1 eq $s2 ? $s1 : "\n";
}

请注意,结果与您的结果不同,因为kick the bucket之后的空格未被删除。

答案 2 :(得分:1)

Bash解决方案,相当不同于Perl,具有相同的不同结果(因为kick the bucket之后的空格未被删除):

#!/bin/bash

shopt -s nocasematch

exec 3<> source.txt              # Open source.txt and assign fd 3 to it.
exec 4<> processed.txt
while read <&3 varline && read <&4 varpro
do
    varline_noPunct=`echo $varline | tr -d '[:punct:]'`
    varpro_noPunct=`echo $varpro | tr -d '[:punct:]'`
    [[ $varline_noPunct == $varpro_noPunct ]] && echo "$varline" || echo 
done
exec 3>&-       # Close fd 3.
exec 4>&- 

答案 3 :(得分:1)

检查此解决方案是否可以帮助您:

use strict;
use warnings;

my $f1 = $ARGV[0];
open FILE1, "<", $f1 or die $!;
my $f2 = $ARGV[1];
open FILE2, "<", $f2 or die $!;


open OUTFILE, ">", "cmp.txt" or die $!;

my %seen;
while (<FILE1>) {
      $_ =~ s/[[:punct:]]//isg;     
    $seen{lc($_)} = 1;
}

while (<FILE2>) {
    my $next_line = <FILE2>;
    $_ =~ s/[[:punct:]]//isg;
    if ($seen{lc($_)}) {    
        print OUTFILE $_;
    }
}
close OUTFILE;