我正在尝试合并两个具有以下结构的巨大数据框(每个 4+百万):
数据框A:
date Fruit a b c d
01 "apple" 0 3 5 1
03 "apple" 8 2 7 2
02 "banana" 1 4 3 5
04 "banana" 3 5 2 6
03 "pineapple" 2 6 4 6
05 "pineapple" 3 5 7 9
数据框B:
date Fruits x y z
01 "apple, pear, strawberry" a n q
02 "banana, apple, coconut" b m p
03 "pineapple, pear, banana" c s o
04 "banana, apple, coconut" d f v
05 "pineapple, pear, banana" r ñ t
我要实现的是具有以下结构的第三个数据框:
数据框C:
date Fruit a b c d x y z
01 "apple" 0 3 5 1 a n q
03 "apple" 0 3 5 1 0 0 0
02 "banana" 1 4 3 5 b m p
04 "banana" 1 4 3 5 d f v
03 "pineapple" 2 6 4 6 c s o
05 "pineapple" 2 6 4 6 r ñ t
...
我已经尝试过类似的方法:
test = market_test.assetCode.apply(lambda x : news_test.assetCodes.str.find(x)>=0)
但是我的内核坏了,我还尝试了使用for循环将 B 数据帧的fruit列扩展为'fruit-b'列,并保留了其他 B的数据列,然后在date列和' fruit-B '列之间合并,但是执行时间太长。
是否可以使用不消耗大量时间和内存的数据帧 A 和 B 获取数据帧 C ?>
水果和水果列的类型为字符串。
答案 0 :(得分:0)
使用:
print (df_A)
date Fruit a b c d
0 1 apple 0 3 5 1
1 3 apple 8 2 7 2
2 2 banana 1 4 3 5
3 4 banana 3 5 2 6
4 3 pineapple 2 6 4 6
5 5 pineapple 3 5 7 9
print (df_B)
date Fruits x y z
0 1 apple, pear, strawberry a n q
1 2 banana, apple, coconut b m p
2 3 pineapple, pear, banana c s o
3 4 banana, apple, coconut d f v
4 5 pineapple, pear, banana r ñ t
import pandas as pd
import numpy as np
# Split the strings into list.
df_B.Fruits = df_B.Fruits.str.split(', ')
# reindex and repeat on length of list
temp = df_B.reindex(df_B.index.repeat(df_B.Fruits.str.len())).drop('Fruits',1)
temp['Fruit'] = np.concatenate(df_B.Fruits.values)
df_C = df_A.merge(temp, on=['date','Fruit'], how='left').fillna(0)
print (df_C)
date Fruit a b c d x y z
0 1 apple 0 3 5 1 a n q
1 3 apple 8 2 7 2 0 0 0
2 2 banana 1 4 3 5 b m p
3 4 banana 3 5 2 6 d f v
4 3 pineapple 2 6 4 6 c s o
5 5 pineapple 3 5 7 9 r ñ t