Question

我正在读取一个文件并从该行的每个开头获取第一个元素，并将其与我的列表进行比较，如果找到，则将其附加到新的输出文件，该文件应该与输入文件完全相同结构条款。

my_id_list = [
4985439
5605471
6144703
]

输入文件：

4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733

我的尝试：

output_file = []
input_file = open('input_file', 'r')

for line in input_file:

    my_line = np.array(line.split())

    id = str(my_line[0])
    if id in my_id_list:
        output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')

问题是：

目前在每行写入输出文件后添加一个额外的空行。我该如何解决？还是有其他方法可以更有效地做到这一点吗？

更新

输出文件应该是这个例子：

4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507

Answer 1

尝试这样的事情

# read lines and strip trailing newline characters
with open('input_file','r') as f:
    input_lines = [line.strip() for line in f.readlines()]

# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]

# write to output file
with open('output_file','w') as f:
    f.write('\n'.join(output_file))

Answer 2

我不知道numpy在阅读时会对文字做什么，但这就是你如何在没有numpy的情况下做到这一点：

my_id_list = {4985439, 5605471, 6144703}  # a set is faster for membership testing

with open('input_file') as input_file:
    # Your problem is most likely related to line-endings, so here
    # we read the inputfile into an list of lines with intact line endings.
    # To preserve the input, exactly, you would need to open the files
    # in binary mode ('rb' for the input file, and 'wb' for the output
    # file below).
    lines = input_file.read().splitlines(keepends=True)

with open('output_file', 'w') as output_file:
    for line in lines:
        first_word = line.split()[0]
        if first_word in my_id_list:
            output_file.write(line)

获取每行的第一个单词是浪费的，因为：

first_word = line.split()[0]

当我们只需要第一个时，

会在行中创建所有“单词”的列表。

如果您知道列是用空格分隔的，那么只需在第一个空格上拆分就可以提高效率：

first_word = line.split(' ', 1)[0]

for循环文件读取行和基于列表的过滤器删除不必要的空行

问题是：

更新

2 个答案: