Question

我想计算两个文件之间的delta字。

file_1.txt的内容为One file with some text and words.。
file_1.txt的内容为One file with some text and additional words to be found.。

Unix系统上的

diff命令提供以下信息。 difflib可以提供类似的输出。

$ diff file_1.txt file_2.txt 
1c1
< One file with some text and words.
---
> One file with some text and additional words to be found.

是否有一种简单的方法可以找到两个文件之间添加或删除的字词，或至少在git diff --word-diff之间的两行之间。

Answer 1

首先，您需要使用open()将文件读入字符串，其中'file_1.txt'是文件的路径，'r'用于＆＃34;读取模式＆＃34;。类似于第二个文件。当你完成时，不要忘记close()你的档案！使用split(' ')函数将刚读过的字符串拆分为单词列表。

file_1 = open('file_1.txt', 'r')
text_1 = file_1.read().split(' ')
file_1.close()
file_2 = open('file_2.txt', 'r')
text_2 = file_2.read().split(' ')
file_2.close()

下一步，您需要区分text_1和text_2列表变量（对象）。有很多方法可以做到。

1）

您可以使用Counter库中的collections类。将列表传递给类的构造函数，然后通过减法以直接和反向顺序查找差异，调用elements()方法获取元素，list()将其转换为列表类型。

from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)
difference = list((text_count_1 - text_count_2).elements()) + list((text_count_2 - text_count_1).elements())

以下是计算delta字的方法。

from collections import Counter
text_count_1 = Counter(text_1)
text_count_2 = Counter(text_2)

delta = len(list((text_count_2 - text_count_1).elements())) \
      - len(list((text_count_1 - text_count_2).elements()))

print(delta)

2）

使用Differ库中的difflib类。将两个列表都传递到compare()类的Differ方法，然后使用for进行迭代。

from difflib import Differ
difference = []
for d in Differ().compare(text_1, text_2):
    difference.append(d)

然后你可以像这样计算delta字。

from difflib import Differ

delta = 0

for d in Differ().compare(text_1, text_2):
    status = d[0]

    if status == "+":
        delta += 1

    elif status == "-":
        delta -= 1

print(delta)

3）

你可以自己写差异法。例如：

def get_diff (list_1, list_2):
    d = []
    for item in list_1:
        if item not in list_2:
            d.append(item)
    return d

difference = get_diff(text_1, text_2) + get_diff(text_2, text_1)

我认为还有其他方法可以做到这一点。但我会限制三个。由于您获得了difference列表，因此可以按照您的意愿管理输出。

Answer 2

..这是使用dict（）

执行此操作的另一种方法

#!/usr/bin/python

import sys


def loadfile(filename):
  h=dict()
  f=open(filename)
  for line in f.readlines():
    words=line.split(' ')
    for word in words:
        h[word.strip()]=1
  return h

first=loadfile(sys.argv[1])
second=loadfile(sys.argv[2])


print "in both first and second"
for k in first.keys():
  if k and k in second.keys():
    print k

两个TXT文件之间的Delta字

2 个答案:

1）

2）

3）