在与该文件比较后,我希望基于2列匹配将多个文件与单个(f1.txt)文件合并。我可以在大熊猫中做到这一点,但它会将所有内容读取到内存中,而这会变得非常快。我认为逐行读取不会将所有内容加载到内存中。熊猫现在也不是一个选择。在未与f1.txt匹配的单元格中填充空值时如何执行操作?
在这里,我使用了一个字典,我不确定它是否会保存在内存中,而且在其他文件中没有匹配项的f1.txt中,我也找不到添加null的方法。其他文件最多可以包含1000个不同的文件。时间不重要,只要我不将所有内容都读到内存中即可
文件(制表符分隔)
f1.txt
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
a1.txt
A B num val scol fcol dcol
1 a1 1000 2 3 0.2 0.77
2 a2 456 7 2 0.3 0.4
3 a3 23 2 7 0.5 0.6
4 a4 800 7 3 0.003 0.088
a2.txt
A B num val scol fcol2 dcol1
2 a2 456 7 2 0.7 0.8
4 a4 800 7 3 0.9 0.01
5 a5 10 8 7 0.03 0.07
当前代码
import os
import csv
m1 = os.getcwd() + '/f1.txt'
files_to_compare = [i for i in os.listdir('dir')]
dictionary = dict()
dictionary1 = dict()
with open(m1, 'rt') as a:
reader1 = csv.reader(a, delimiter='\t')
for x in files_to_compare:
with open(os.getcwd() + '/dir/' + x, 'rt') as b:
reader2 = csv.reader(b, delimiter='\t')
for row1 in list(reader1):
dictionary[row1[0]] = list()
dictionary1[row1[0]] = list(row1)
for row2 in list(reader2):
try:
dictionary[row2[0]].append(row2[5:])
except KeyError:
pass
print(dictionary)
print(dictionary1)
我要实现的目标类似于使用:df.merge(df1,on = ['A','B'],how ='left')。fillna('null')
current result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
Desired result
{'A': [['fcol1', 'dcol1'], ['fcol', 'dcol']], '1': [['0.2', '0.77'],['null', 'null']], '2': [['0.7', '0.8'], ['0.3', '0.4']], '3': [['0.5', '0.6'],['null', 'null']], '4': [['0.9', '0.01'], ['0.003', '0.088']], '5': [['null', 'null'],['0.03', '0.07']]}
{'A': ['A', 'B', 'num', 'val', 'scol'], '1': ['1', 'a1', '1000', '2', '3'], '2': ['2', 'a2', '456', '7', '2'], '3': ['3', 'a3', '23', '2', '7'], '4': ['4', 'a4', '800', '7', '3'], '5': ['5', 'a5', '10', '8', '7']}
我的最终目的是将字典写入文本文件。我不知道将使用多少内存,或者它是否适合内存。如果有不使用熊猫的更好的方法,那将是很好的,否则我将如何使字典工作?
任务尝试
import dask.dataframe as dd
directory = 'input_dir/'
first_file = dd.read_csv('f1.txt', sep='\t')
df = dd.read_csv(directory + '*.txt', sep='\t')
df2 = dd.merge(first_file, df, on=[A, B])
I kept getting ValueError: Metadata mismatch found in 'from_delayed'
+-----------+--------------------+
| column | Found | Expected |
+--------------------------------+
| fcol | int64 | float64 |
+-----------+--------------------+
我用Google搜索,发现了类似的投诉,但无法解决。这就是为什么我决定尝试这一点的原因。检查了我的文件,所有dtypes似乎都一致。我的dask版本是2.9.1
答案 0 :(得分:1)
如果要使用手工制作解决方案,可以查看heapq.merge
和itertools.groupby
。假设您的文件按前两列(键)排序。
我举了一个简单的示例,将文件合并和分组并生成两个文件,而不是字典(因此(几乎)什么都没有存储在内存中,所有内容都在磁盘上读/写):
from heapq import merge
from itertools import groupby
first_file_name = 'f1.txt'
other_files = ['a1.txt', 'a2.txt']
def get_lines(filename):
with open(filename, 'r') as f_in:
for line in f_in:
yield [filename, *line.strip().split()]
def get_values(lines):
for line in lines:
yield line
while True:
yield ['null']
opened_files = [get_lines(f) for f in [first_file_name] + other_files]
# save headers
headers = [next(f) for f in opened_files]
with open('out1.txt', 'w') as out1, open('out2.txt', 'w') as out2:
# print headers to files
print(*headers[0][1:6], sep='\t', file=out1)
new_header = []
for h in headers[1:]:
new_header.extend(h[6:])
print(*(['ID'] + new_header), sep='\t', file=out2)
for v, g in groupby(merge(*opened_files, key=lambda k: (k[1], k[2])), lambda k: (k[1], k[2])):
lines = [*g]
print(*lines[0][1:6], sep='\t', file=out1)
out_line = [lines[0][1]]
iter_lines = get_values(lines[1:])
current_line = next(iter_lines)
for current_file in other_files:
if current_line[0] == current_file:
out_line.extend(current_line[6:])
current_line = next(iter_lines)
else:
out_line.extend(['null', 'null'])
print(*out_line, sep='\t', file=out2)
产生两个文件:
out1.txt
:
A B num val scol
1 a1 1000 2 3
2 a2 456 7 2
3 a3 23 2 7
4 a4 800 7 3
5 a5 10 8 7
out2.txt
:
ID fcol dcol fcol2 dcol1
1 0.2 0.77 null null
2 0.3 0.4 0.7 0.8
3 0.5 0.6 null null
4 0.003 0.088 0.9 0.01
5 null null 0.03 0.07