Question

我有2个文件： - phrase.txt - words_to_erase.txt

我需要一种方法来查找'tones.txt'中包含'words_to_replace.txt'文件中至少1个单词的所有短语，并创建以下内容：

new_phrases.txt：这是没有上一步中找到的所有短语的新文件。

erased_phrases：此文件包含为了创建'new_phrases.txt'文件而删除的所有短语

我可以使用python或linux。

注意：

phrase.txt是一个包含100k个短语的文件，每行1个短语

words_to_erase.txt是一个包含80个不同单词的文件，每行1个单词。

我尝试使用Linux：

grep -f words_to_erase.txt phrases.txt > newfile.txt

这样我只得到一个没有替换短语的新短语的文件，我不认为这种情况不敏感，我尝试使用-i并且它似乎不起作用。

我用类似的东西尝试了python：

in_file = open("words_to_erase.txt", "rt") 
contents = in_file.read(line)         
in_file.close()     
print contents              

sourcefile = "phrases.txt"
filename2 = "newfile.txt"

def fixup( filename ): 
    print "fixup ", filename 
    fin = open( filename ) 
    fout = open( filename2 , "w") 
    for line in contents: 
        if not any(item in line for item in contents):
                fout.write(line)  
    fin.close() 
    fout.close() 

fixup(sourcefile)

Answer 1

我使用此脚本从包含400k短语（phrase.txt）的文件中删除并删除包含1000个单词（words_to_erase.txt）文件中包含单词的所有行，该脚本大约需要15分钟才能完成但准确率为100％。

注意 - 当我使用grep -f words_to_erase.txt phrase.txt时，grep正在跳过包含words_to_erase.txt文件中的单词的许多短语，使用此bash脚本逐字搜索并输出正确的数字短语。

要创建脚本：复制此脚本并将其粘贴到文本编辑器上，使用任何名称和扩展名保存.sh

#!/bin/bash
cat words_to_erase.txt | while read line 
do
    echo $line
    grep -iwv $line phrases.txt >> newfile.txt
    cat newfile.txt | sort | uniq >> final_file.txt
done

2.-使脚本易于理解：

    chmod -x $name_of_script.sh

运行脚本：
```
./$name_of_script.sh
```

查找列表文件中至少包含一个单词的所有短语，并将其保存到新文件中

1 个答案: