我如何加入这两个文本文件?
文件1:
1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713
文件2:
1000001 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002
1000003 555:0.0585849 91:0.0164101
结果:
1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713 555:0.0585849 91:0.0164101
解释
文档1 和文档2 都具有相同的结构,并且它们具有相同的行数。
每行以一个数字开头(两个文档中的数字相同) ),然后我们在每一行中有几个由数字+冒号+十进制数组成的项目:
示例 10:0.471669 >
这些项目组合是唯一的,我想要做的是将它们合并在一起:从第二个文档中获取每行的项目并将其放在第一个文档的相应行中。
注意:
开头的初始数字和彼此的项目由一个空格分隔。
这是我的尝试:
dat1 = {}
with open('doc1') as f:
for line in f.readlines():
dat1[line.split(' ')[0]] = line.strip().split(' ')[1:]
dat2 = {}
with open('doc2') as f:
for line in f.readlines():
key = line.split(' ')[0]
dat2[key] = line.split(' ')[1:]
for key in dat1.keys():
print("%s %s %s" % (key, str.join(' ', dat1[key]), str.join(' ', dat2[key])))
我在第二个文档的行上得到了KeyError的回溯,当行没有任何东西要添加到第一个文档时。在上面的例子中,第二个文件的第二行就是这种情况。
如何逃避此异常?逃避只有键的行,没有别的东西可以添加?
答案 0 :(得分:4)
更简单的方法可能是使用defaultdict
列表:
from collections import defaultdict
data = defaultdict(list)
for filename in 'doc1', 'doc2':
with open(filename) as f:
for line in f:
key, _, value = line.partition(' ')
data[key.strip()].append(value.strip())
for key in sorted(data):
print key, ' '.join(data[key]) # Python 2
# print(key, *data[key]) # Python 3
关于打印结果,您可以添加:
from __future__ import print_function
到文件的顶部,然后Python 3中将提供Python 3 print()
函数,即您可以使用上面的Python 3打印。
您在评论中询问如何打印到文件(导入print_function
后的Python 3或Python 2):
with open('outfile.txt', 'w') as f:
for key in sorted(data):
print(key, *data[key], file=f)
答案 1 :(得分:2)
问题在于换行符。
在文件中每行的末尾有一个换行符,它将包含在每行的最后一个条目中。发生异常是因为dat1将具有键"1000002"
并且dat2将具有键"1000002\n"
。
如果在解析之前有line = line.strip()
,那么代码应该按预期工作。
for line in f.readlines():
line = line.strip()
key = line.split(' ')[0]
dat2[key] = line.split(' ')[1:]
答案 2 :(得分:1)
您可以使用pop
操作来获取数组的第一项,如下所示:
def read_stem(f):
res = {}
for line in f.readlines():
items = line.strip().split()
res[items.pop(0)] = items
return res
with open('stem.data') as f:
dat1 = read_stem(f)
with open('stem.info') as f:
dat2 = read_stem(f)
with open('myfile','w') as f:
for key in dat1.keys():
f.write("%s %s\n" % (key, ' '.join(dat1[key] + dat2[key])))
答案 3 :(得分:1)
在第二个文件中的代码中,空行的键是' 1000002 \ n'不是1000002,这可能是原因,这是有效的。
file1_lines= open('doc1', 'r').readlines()
file2_lines = open('doc1', 'r').readlines()
resfile = open('res.txt','w')
dat1 = {}
for line in file1_lines:
dat1[line.split(' ')[0]] = line.strip().split(' ')[1:]
dat2 = {}
for line in file2_lines:
dat2[line.strip().split(' ')[0]] = line.strip().split(' ')[1:]
print(dat1)
print(dat2)
for key in dat1.keys():
print("%s %s %s" % (key, str.join(' ', dat1[key]), str.join(' ', dat2[key])))
resfile.write("%s %s %s" % (key, str.join(' ', dat1[key]), str.join(' ', dat2[key])))
答案 4 :(得分:1)
您可以使用:
doc1_name = 'doc1'
doc2_name = 'doc2'
def get_key_and_value(key_value_list):
if len(key_value_list) == 2:
# list has key and values
key, value = key_value_list
elif len(key_value_list) == 1:
# list only has key
key, value = key_value_list[0], ''
else:
# should not happen!
key, value = '', ''
return key,value
def join_dict(key, value, _dict, sep=' '):
if key in _dict.keys():
_dict[key] = sep.join((_dict[key], value))
else:
_dict[key] = value
result = {}
with open(doc1_name, 'r') as doc1, \
open(doc2_name, 'r') as doc2:
doc1_lines = doc1.readlines()
doc2_lines = doc2.readlines()
for list_of_lines in (doc1_lines, doc2_lines):
for line in list_of_lines:
# The .strip('\n') removes the \n at the end
# and the .split(' ', 1) split only once
key_value = line.strip('\n').split(' ', 1)
# split the lines only once to get the keys:
key, value = get_key_and_value(key_value)
# this can be ignored if it is known that the keys will be the same
join_dict(key, value, result)
# order the keys
ordered_keys = result.keys()
ordered_keys.sort()
# and write them to a file
with open('+'.join((doc1_name,doc2_name)),'w') as output:
for key in ordered_keys:
output.write(' '.join((key, result[key]))+'\n')
1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713
1000001 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002
1000003 555:0.0585849 91:0.0164101
1000001 10:0.471669 250:0.127552 30:0.218773 64:0.249413 161:0.115664 207:0.136537 294:0.0974809 301:0.199868
1000002 130:0.0839656 107:0.185613 30:0.446355 110:0.38011
1000003 1:0.0835855 1117:0.0647112 302:0.0851354 46:0.0601825 48:0.098907 516:0.167713 555:0.0585849 91:0.0164101