Question

我有两个文本文件file_1和file_2

文件_1：

社交媒体是互联网上最常用的应用，其中人们讨论并分享与社区推特流行社交媒体应用程序，并且具有 目击者 更广泛的社交媒体应用在整个世界研究人员的政治运动中普遍使用用于各种政治活动 选举前过程并发布论文推特已被用作了解人们情绪的论坛

文件_2：

社交媒体是互联网上最常用的出版物，其中人们这种情况并出售与社区准备深入社交媒体应用程序和帮助 见证人一个社交机器人在政治运动中深远有用四个严肃政治上的活动 报废效率低下的过程并发布人们 接受治疗的已被用作了解人们情绪的论坛。

我用粗体标记了这是两个文件之间的单词更改。我希望在file_3中有这个词差异，如下所示。

应用-发布

讨论-这种情况

分享-出售

我在下面尝试过，但给定的区别

import difflib
with open('file_1') as f1:
    f1_text = f1.read()
with open('file_2') as f2:
    f2_text = f2.read()


for line in difflib.unified_diff(f1_text, f2_text, fromfile='file_1', tofile='file_2', lineterm=''):
    print line

都接受Python或Shell脚本..预先感谢...

Answer 1

有趣的问题。可能有一种简单的方法，但这是我能想到的。基本上，首先将两个文件按空格分开，然后按单词或一组单词进行比较。

a = file_1.split(" ")
b = file_2.split(" ")
a_ind, b_ind = 0, 0
lookup_range = 5 #configure if required
for i in range(len(a)):
    try:
        if a[a_ind] == b[b_ind]:
            b_ind += 1
            a_ind += 1
        else:
            for word in a[a_ind:a_ind+lookup_range]:
                if word in b[b_ind:b_ind+lookup_range]:
                    offset_a,offset_b = a[a_ind:a_ind+lookup_range].index(word),b[b_ind:b_ind+lookup_range].index(word)
                    print (f"{' '.join(a[a_ind:a_ind+offset_a])} - {' '.join(b[b_ind:b_ind+offset_b])}")
                    a_ind +=offset_a
                    b_ind +=offset_b
                    break
    except IndexError:
        break

结果：

application - publication
discuss - this case
share - sell
emotions - emotion
community twitter popular - communities prepare profound
has witnessed - help witness
wider reach - robot which
popularly used - profoundly useful
world researchers - the more resources
for various - four serious
campaigns prior to election - campaign scrap inefficient
paper twitter - people treated

两个文本文件之间的单词（非行）差异

1 个答案: