Question

我有两个文本文件。

file1.txt 有：

gedit
google chrome
git
vim
foo
bar

file2.txt 有：

firefox
svn
foo
vim

如何编写一个脚本，在执行时（使用 file1.txt 和 file2.txt 作为参数）检查每行中的文本重复< / strong>（我的意思是它应该按行处理），删除两个文件中的重复文本。

因此，处理完毕后， file1.txt 和 file2.txt 都应包含以下内容：

gedit google chrome git bar firefox svn

请注意，foo和vim已从这两个文件中删除。

任何指导？

Answer 1

with open('file1.txt','r+') as f1 ,open('file2.txt','r+') as f2:
    file1=set(x.strip() for x in f1 if x.strip())
    file2=set(x.strip() for x in f2 if x.strip())
    newfile=file1.symmetric_difference(file2) #symmetric difference removes those values which are present in both sets, and returns a new set.
    f2.truncate(0) #truncate the file to 0 bytes
    f1.truncate(0)
    f2.seek(0) # to push the cursor back to the starting pointing in the file.
    f1.seek(0)
    for x in newfile:
        f1.write(x+'\n')
        f2.write(x+'\n')

现在两个文件都包含：

svn
git
firefox
gedit
google chrome
bar

Answer 2

您将过滤后的文件另存为第三个文件吗？

无论如何，在每个文件上制作2个循环，并将每个循环索引值与另一个循环索引值进行比较，如果它们相等，则delet并尊重地移动。伪代码：

Def func(File a, File b):
 for i in a: 
    for j in b:
      if a[i]==b[j]:
        copy and move
      endif
     endfor
  endfor

Answer 3

如果我正确理解你的问题，那应该很容易。

alist = []
for i in ifile1:
    alist.append(i)

for i in ifile2:
    if i in alist:
        alist.remove(i)
    else:
        alist.append(i)

for i in alist:
    print i

Answer 4

如果文件相当小以适应内存，这将完成工作：

with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
    # create a set from the bigger file 
    result = set(x.strip() for x in f1.readlines())
    # remove duplicates or add unique values from 2nd file
    for line in f2:
        line = line.strip()
        if line in result:
            result.remove(line)
        else:
            result.add(line)
result = "\n".join(result)

# for debug, don't replace original files
with open("file1_out.txt", "w") as f1, open("file2_out.txt", "w") as f2:
    f1.write(result)
    f2.write(result)

# if not inside a function, free memory explicitly  
del result

Answer 5

对于引入Counter的Python 2.7+

>>> from collections import Counter
>>> file_1 = ['gedit','google chrome','git','vim','foo','bar']
>>> file_2 = ['firefox','svn','foo','vim']
>>> de_dup = [i for i,c in Counter(file_1+file_2).itertimes() if c == 1]
>>> de_dup
['svn', 'git', 'bar', 'gedit', 'google chrome', 'firefox']

Answer 6

让我们从输入文件名开始：

files = ('raz.txt','dwa.txt')

还有一些辅助功能。这是一个读取文件中所有单词的生成器，

def read(filename):
    with open(filename) as f:
        for line in f:
            if len(line)>0:
                yield line.strip()

这会将序列写入文件。

def write(filename, lines):
    with open(filename, 'w') as f:
        f.write('\n'.join(lines))

所以让我们创建两个生成器 - 每个输入文件一个

words = [read(filename) for filename in files]

然后，让我们将生成器列表转换为集合列表

wordSets = map(set, words)

现在我们有一组包含每个文件中唯一单词的2组。

让我们创建另一个包含所有输入文件中的单词的集合，通过交叉它们的集合：

commonWords = set.intersection(*wordSets)

重写的时间。

for filename in files:

由于我们想要保存到完全相同的文件，遗憾的是我们需要首先将其全部内容读取到内存中，然后从那里写入。（如果您想要输出不同的文件，则不必缓冲文件。

让我们创建一个阅读器生成器，然后通过用list()包装它来将所有内容读取到内存中：

    lines = list(read(filename))

然后按顺序将单词写回给定文件，但前提是它们不在commonWords中

    write(filename, (word for word in lines if word not in commonWords))

输入：

raz.txt

gedit
google chrome
git
vim
foo
bar

dwa.txt

firefox
svn
foo
vim

输出：

raz.txt

gedit
google chrome
git
bar

dwa.txt

firefox
svn

从两者中删除了重复项。

如何删除两个文件中相同的单词？

6 个答案: