Question

我正在尝试删除3列制表符分隔的txt文件的重复项，但只要前两列是重复的，那么即使两列具有不同的第3列，也应将其删除。

from operator import itemgetter
import sys

input = sys.argv[1]
output = sys.argv[2]

#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1) 
seen = set()
data = []
for line in input.splitlines():
    key = ig(line.split())
    if key not in seen:
        data.append(line)
        seen.add(key)
        file = open(output, "w")
        file.write(data)
        file.close()

首先，我收到错误

key = ig(line.split())
IndexError: list index out of range

另外，我无法看到如何将结果保存到output.txt

人们说保存到output.txt是一个非常基本的问题。但没有教程帮助。

我尝试过使用编解码器的方法，那些使用编解码器的方法，使用file.write（数据）的方法，并且都没有帮助。

我可以很容易地学习MatLab。在线教程非常精彩，一系列谷歌搜索总是帮助很多。

但我还无法找到有用的Python教程。这显然是因为我是一个完整的新手。对于像我这样的完整新手，最好的教程是什么，1）全面性和2）大量的例子3）逐行解释，没有解释就留下任何一行？

为什么上面的代码会导致错误而不能保存结果？

Answer 1

我假设您将input分配给第一个命令行参数input = sys.argv[1]和output到第二个，您打算将它们作为您的输入和输出文件名。但是你永远不会为输入数据打开任何文件，因此你要在文件名称上调用.splitlines()，而不是在文件内容上调用。
接下来，splitlines()无论如何都是错误的做法。 To iterate over a file line-by-line, simply use for line in f，其中f是一个打开的文件。这些行将在行尾包含换行符，因此如果它不应该是第三列数据的一部分，则需要将其删除。
然后你在循环中打开和关闭文件，这意味着你会尝试在每次迭代时将data的全部内容写入文件，有效地覆盖写入文件的任何数据之前。因此，我将该块移出了循环。
It's good practice to use the with statement for opening files。 with open(out_fn, "w") as outfile将打开名为out_fn的文件，并将打开的文件分配给outfile，并在退出该缩进块后立即将其关闭。
input是Python中的内置函数。因此，我重命名了您的变量，因此没有内置名称被遮蔽。
您正在尝试直接将data写入输出文件。这不起作用，因为data是行的列表。您需要先join这些行，才能在将其写入文件之前将它们再次转换为单个字符串。

所以这是你的代码，解决了所有这些问题：

from operator import itemgetter
import sys


in_fn = sys.argv[1]
out_fn = sys.argv[2]

getkey = itemgetter(0, 1)
seen = set()
data = []

with open(in_fn, 'r') as infile:
    for line in infile:
        line = line.strip()
        key = getkey(line.split())
        if key not in seen:
            data.append(line)
            seen.add(key)

with open(out_fn, "w") as outfile:
    outfile.write('\n'.join(data))

Answer 2

为什么上面的代码会导致错误？
由于您尚未打开文件，因此您尝试使用字符串input.txt而不是文件。然后，当您尝试访问项目时，列表索引会超出范围，因为line.split()会返回['input.txt']。如何解决这个问题：打开文件，然后使用它，而不是使用它的名称。例如，你可以这样做（我尽量保持尽可能接近你的代码）

input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
    (...)

为什么这不是保存结果？
因为您在循环内打开/关闭文件。您需要做的是在您退出循环后写入数据。此外，您无法直接将列表写入文件。因此，你需要做一些事情（在你的循环之外）：

outfile = open(output, "w")
for item in data:
  outfile.write(item)
outfile.close()

一起
还有其他方法可以读取/写入文件，it is pretty well documented on the internet但是我试图接近你的代码，以便你能更好地理解它的错误

from operator import itemgetter
import sys

input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]

#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
    print line
    key = ig(line.split())
    if key not in seen:
        data.append(line)
        seen.add(key)

print data
outfile = open(output, "w")
for item in data:
  outfile.write(item)
outfile.close()

PS：它似乎产生了你需要的结果Python to remove duplicates using only some, not all, columns

Python删除重复项并保存结果

2 个答案: