我正在尝试合并pandas中的一系列数据帧。我有一个dfs列表,dfs
和相应标签labels
的列表,我想将所有dfs合并为1 df,这样df中的公共标签就会从其标签中获取后缀在labels
列表中。即:
def mymerge(dfs, labels):
labels_dict = dict([(d, l) for d, l in zip(dfs, labels)])
merged_df = reduce(lambda x, y:
pandas.merge(x, y,
suffixes=[labels_dict[x], labels_dict[y]]),
dfs)
return merged_df
当我尝试这个时,我收到错误:
pandas.tools.merge.MergeError: Combinatorial explosion! (boom)
我正在尝试进行一系列合并,每次合并最多会增加列数N,其中N是列表中“next”df中的列数。最终的DF应该具有与添加在一起的所有df列一样多的列,因此它会相加增长而不是组合。
我正在寻找的行为是:在指定的列名称上加入dfs(例如由on=
指定)或者dfs被索引。联合非公共列名称(如外连接)。如果列出现在多个dfs中,则可以选择覆盖它。更多关注文档,听起来update
可能是最好的方法。虽然当我尝试join='outer'
时,它会引发一个异常信号,表明它没有实现。
编辑:
以下是我对此实现的尝试,它不处理后缀,但说明了我正在寻找的合并类型:
def my_merge(dfs_list, on):
""" list of dfs, columns to merge on. """
my_df = dfs_list[0]
for right_df in dfs_list[1:]:
# Only put the columns from the right df
# that are not in the existing combined df (i.e. new)
# or which are part of the columns to join on
new_noncommon_cols = [c for c in right_df \
if (c not in my_df.columns) or \
(c in on)]
my_df = pandas.merge(my_df,
right_df[new_noncommon_cols],
left_index=True,
right_index=True,
how="outer",
on=on)
return my_df
这假设合并发生在每个dfs的索引上。外部联接样式中添加了新列,但通过on=
关键字在联接中使用了常见的列(而不是索引的一部分)。
示例:
df1 = pandas.DataFrame([{"employee": "bob",
"gender": "male",
"bob_id1": "a"},
{"employee": "john",
"gender": "male",
"john_id1": "x"}])
df1 = df1.set_index("employee")
df2 = pandas.DataFrame([{"employee": "mary",
"gender": "female",
"mary_id1": "c"},
{"employee": "bob",
"gender": "male",
"bob_id2": "b"}])
df2 = df2.set_index("employee")
df3 = pandas.DataFrame([{"employee": "mary",
"gender": "female",
"mary_id2": "d"}])
df3 = df3.set_index("employee")
merged = my_merge([df1, df2, df3], on=["gender"])
print "MERGED: "
print merged
根据一组常用列的标签,你可以随意为每个df标记一个后缀,但这并不重要。以上合并操作是否可以在pandas中更优雅地完成,或者已经作为内置存在?
答案 0 :(得分:5)
您的方法的输出:
In [29]: merged
Out[29]:
bob_id1 gender john_id1 bob_id2 mary_id1 mary_id2
employee
bob a male NaN b NaN NaN
john NaN male x NaN NaN NaN
mary NaN female NaN NaN c d
内置{4}的pandas解决方案:
In [28]: reduce(lambda x,y: x.combine_first(y), [df1, df2, df3])
Out[28]:
bob_id1 bob_id2 gender john_id1 mary_id1 mary_id2
employee
bob a b male NaN NaN NaN
john NaN NaN male x NaN NaN
mary NaN NaN female NaN c d
要为每个框架的列添加后缀,我建议在调用combine_first
之前重命名列。
另一方面,您可能希望查看类似pd.concat([df1, df2, df3], keys=['d1', 'd2', 'd3'], axis=1)
的操作,该操作会生成包含MultiIndex列的数据框。在这种情况下,可能需要考虑将性别作为索引的一部分或与其重复使用。
答案 1 :(得分:1)
:
max_groups = 1L
for x in group_sizes:
max_groups *= long(x)
if max_groups > 2**63: # pragma: no cover
raise Exception('Combinatorial explosion! (boom)')
而且,在same file
中# max groups = largest possible number of distinct groups
left_key, right_key, max_groups = self._get_group_keys()
行max_groups *= long(x)
表示它不是附加的,因此很关键。