我正在读取一个文件并从该行的每个开头获取第一个元素,并将其与我的列表进行比较,如果找到,则将其附加到新的输出文件,该文件应该与输入文件完全相同结构条款。
my_id_list = [
4985439
5605471
6144703
]
输入文件:
4985439 16:0.0719814
5303698 6:0.09407 19:0.132581
5605471 5:0.0486076
5808678 8:0.130536
6144703 5:0.193785 19:0.0492507
6368619 3:0.242678 6:0.041733
我的尝试:
output_file = []
input_file = open('input_file', 'r')
for line in input_file:
my_line = np.array(line.split())
id = str(my_line[0])
if id in my_id_list:
output_file.append(line)
np.savetxt("output_file", output_file, fmt='%s')
目前在每行写入输出文件后添加一个额外的空行。我该如何解决?还是有其他方法可以更有效地做到这一点吗?
输出文件应该是这个例子:
4985439 16:0.0719814
5605471 5:0.0486076
6144703 5:0.193785 19:0.0492507
答案 0 :(得分:2)
尝试这样的事情
# read lines and strip trailing newline characters
with open('input_file','r') as f:
input_lines = [line.strip() for line in f.readlines()]
# collect all the lines that match your id list
output_file = [line for line in input_lines if line.split()[0] in my_id_list]
# write to output file
with open('output_file','w') as f:
f.write('\n'.join(output_file))
答案 1 :(得分:1)
我不知道numpy在阅读时会对文字做什么,但这就是你如何在没有numpy的情况下做到这一点:
my_id_list = {4985439, 5605471, 6144703} # a set is faster for membership testing
with open('input_file') as input_file:
# Your problem is most likely related to line-endings, so here
# we read the inputfile into an list of lines with intact line endings.
# To preserve the input, exactly, you would need to open the files
# in binary mode ('rb' for the input file, and 'wb' for the output
# file below).
lines = input_file.read().splitlines(keepends=True)
with open('output_file', 'w') as output_file:
for line in lines:
first_word = line.split()[0]
if first_word in my_id_list:
output_file.write(line)
获取每行的第一个单词是浪费的,因为:
first_word = line.split()[0]
当我们只需要第一个时,会在行中创建所有“单词”的列表。
如果您知道列是用空格分隔的,那么只需在第一个空格上拆分就可以提高效率:
first_word = line.split(' ', 1)[0]