目的是从两个文件中提取相同的行,同时忽略低/大写并且忽略标点符号
我有两个文件
source.txt
Foo bar
blah blah black sheep
Hello World
Kick the, bucket
processed.txt
foo bar
blah sheep black
Hello world
kick the bucket ,
所需输出(来自source.txt
):
Foo bar
Hello World
Kick the, bucket
我一直这样做:
from string import punctuation
with open('source.txt', 'r') as f1, open('processed.txt', 'r') as f2:
for i,j in zip(f1, f2):
lower_depunct_f1 = " ".join("".join([ch.lower() for ch in f1 if f1 not in punctuation]).split())
lower_depunct_f2 = " ".join("".join([ch.lower() for ch in f2 if f2 not in punctuation]).split())
if lower_depunct_f1 == lower_depunct_f2:
print f1
else:
print
有没有办法用bash
工具执行此操作? perl,shell,awk,sed?
答案 0 :(得分:2)
使用awk
更容易做到这一点:
awk 'FNR==NR {s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); a[s]++;next}
{s=toupper($0); gsub(/[[:blank:][:punct:]]+/, "", s); print (s in a)?$0:""}' file2 file1
Foo bar
Hello World
Kick the, bucket
答案 1 :(得分:2)
Perl解决方案非常类似于Python:
open my $S1, '<', 'source.txt' or die $!;
open my $S2, '<', 'processed.txt' or die $!;
while (defined(my $s1 = <$S1>) and defined (my $s2 = <$S2>)) {
s/[[:punct:]]//g for $s1, $s2;
$_ = lc for $s1, $s2;
print $s1 eq $s2 ? $s1 : "\n";
}
请注意,结果与您的结果不同,因为kick the bucket
之后的空格未被删除。
答案 2 :(得分:1)
Bash解决方案,相当不同于Perl,具有相同的不同结果(因为kick the bucket
之后的空格未被删除):
#!/bin/bash
shopt -s nocasematch
exec 3<> source.txt # Open source.txt and assign fd 3 to it.
exec 4<> processed.txt
while read <&3 varline && read <&4 varpro
do
varline_noPunct=`echo $varline | tr -d '[:punct:]'`
varpro_noPunct=`echo $varpro | tr -d '[:punct:]'`
[[ $varline_noPunct == $varpro_noPunct ]] && echo "$varline" || echo
done
exec 3>&- # Close fd 3.
exec 4>&-
答案 3 :(得分:1)
检查此解决方案是否可以帮助您:
use strict;
use warnings;
my $f1 = $ARGV[0];
open FILE1, "<", $f1 or die $!;
my $f2 = $ARGV[1];
open FILE2, "<", $f2 or die $!;
open OUTFILE, ">", "cmp.txt" or die $!;
my %seen;
while (<FILE1>) {
$_ =~ s/[[:punct:]]//isg;
$seen{lc($_)} = 1;
}
while (<FILE2>) {
my $next_line = <FILE2>;
$_ =~ s/[[:punct:]]//isg;
if ($seen{lc($_)}) {
print OUTFILE $_;
}
}
close OUTFILE;