Question

我有几个由不同的人制作的大型文本文件。这些文件包含每行单个标题的列表。每个句子都是不同的，但据说是指同一个未知的项目集。

鉴于格式和措辞不同，我尝试生成一个较短的文件，可能与手动检查匹配。我是Bash的新手，我尝试了几个命令，将每一行与具有两个或多个关键词的标题进行比较。应避免区分大小写，并且超过4个字符的关键词要排除文章等。

示例：

输入文本文件＃1

Investigating Amusing King : Expl and/in the Proletariat
Managing Self-Confident Legacy: The Harlem Renaissance and/in the Abject
Inventing Sarcastic Silence: The Harlem Renaissance and/in the Invader
Inventing Random Ethos: The Harlem Renaissance and/in the Marginalized
Loss: Supplementing Transgressive Production and Assimilation

输入文本文件＃2

Loss: Judging Foolhardy Historicism and Homosexuality
Loss: Developping Homophobic Textuality and Outrage
Loss: Supplement of transgressive production
Loss: Questioning Diligent Verbiage and Mythos
Me Against You: Transgressing Easygoing Materialism and Dialectic

输出文本文件

File #1-->Loss: Supplementing Transgressive Production and Assimilation
File #2-->Loss: Supplement of transgressive production

到目前为止，我已经能够使用完全相同的条目清除一些副本...

cat FILE_num*.txt | sort | uniq -d > berbatim_duplicates.txt

...以及其他一些括号内注释相同的

  cat FILE_num*.txt | sort | cut -d "{" -f2 | cut -d "}" -f1 | uniq -d > same_annotations.txt

看起来非常有前途的命令是使用正则表达式查找，但我无法使其正常工作。

提前致谢。

Answer 1

在Python 3中：

from sys import argv
from re import sub

def getWordSet(line):
    line=sub(r'\[.*\]|\(.*\)|[.,!?:]','',line).split()
    s=set()
    for word in line:
        if len(word)>4:
            word=word.lower()
            s.add(word)
    return s

def compare(file1, file2):
    file1 = file1.split('\n')
    file2 = file2.split('\n')
    for line1,set1 in zip(file1,map(getWordSet,file1)):
        for line2,set2 in zip(file2,map(getWordSet,file2)):
            if len(set1.intersection(set2))>1:
                print("File #1-->",line1,sep='')
                print("File #2-->",line2,sep='')

if __name__=='__main__':
    with open(argv[1]) as file1, open(argv[2]) as file2:
        compare(file1.read(),file2.read())

给出预期的输出。它显示文件的匹配线对。

将此脚本保存在文件中 - 我将其称为script.py，但您可以根据需要对其进行命名。您可以使用

启动它

python3 script.py file1 file2

您甚至可以使用别名：

alias comp="python3 script.py"

然后

comp file1 file2

我在下面的讨论中包含了这些功能。

查找文本文件中至少有两个共同字的所有行（Bash）

1 个答案: