当您在excel(xlsx,csv或xls)中拥有大型数据集并且必须选择某些重复值时,您是如何做到的?这就像是一种非常模糊和广泛的陈述方式......
举个例子:
DataFrame1:
**Name** **No.** **Comment**
Bob 2123320 Doesn't Matter
Joe 2832883 Whatever
John 2139300 Irrelevant
Bob 2123320 Something
John 2234903 Regardless
DataFrame2:
**Name** **No.** **Report**
Bob 2123320 Great
Joe 2832883 Solid
John 2139300 Awesome
Bob 2123320 Good
John 2234903 Perfect
我基本上只想找一种方法,只选择一个名字出现两次的号码,然后将它们列出来:
**Name** **2139300** **2139300** **2234903** **2234903**
John Irrelevant Awesome Regardless Perfect
所以基本上它会查找每个名称,然后对于每个名称,它会查看它有多少不同的号码,并且对于每个不同的号码,它会搜索"注释"和"报告"是,然后输出如上所述的excel表。虽然Bob出现了两次,但由于两次他都有相同的号码,所以不算数,John是唯一的相关人物。
有没有办法在使用Pandas导入数据帧后执行此操作,例如可能使用字典计算每个名称的每个编号然后合并数据帧?
非常感谢
答案 0 :(得分:1)
这个怎么样?
1)Group&取消堆栈dataframe1和dataframe2以获得您想要的一般形状:
dataframe1_transformed = \
dataframe1.groupby(["**Name**", '**No.**'])['**Comment**'].\
sum().unstack("**No.**")
dataframe2_transformed = \
dataframe2.groupby(["**Name**", '**No.**'])['**Comment**'].\
sum().unstack("**No.**")
dataframe1_transformed
**No.** **Name** 2123320 2139300 2234903 2832883
0 Bob Doesnt MatterSomething None None None
1 Joe None None None Whatever
2 John None Irrelevant Regardless None
dataframe2_transformed
**No.** **Name** 2123320 2139300 2234903 2832883
0 Bob GreatGood None None None
1 Joe None None None Solid
2 John None Awesome Perfect None
2)结合它们:
dataframe_all_transformed = \
dataframe1_transformed.merge(dataframe2_transformed,
how='inner', left_index=True,
right_index=True)
dataframe_all_transformed
**No.** **Name** 2123320_x 2139300_x 2234903_x 2832883_x 2123320_y 2139300_y 2234903_y 2832883_y
0 Bob DoesntMatterSomething None None None GreatGood None None None
1 Joe None None None Whatever None None None Solid
2 John None Irrelevant Regardless None None Awesome Perfect None
3)分别计算独特外观的数量:
num_apperances = dataframe1.drop_duplicates(subset=['**Name**', '**No.**']).\
groupby(['**Name**']).size()
multiple_appearing_names = num_apperances[num_apperances > 1].index
4)仅为这些名称过滤组合的转换数据:
dataframe_multiple_transformed = dataframe_all_transformed.loc[
multiple_appearing_names].T.dropna().T
5)从技术上讲,在数据框中使用相同的列名是一个坏主意,但是因为你需要它:
dataframe_multiple_transformed.columns = \
[x.split("_")[0] for x in dataframe_multiple_transformed.columns]
dataframe_multiple_transformed
**Name** 2139300 2234903 2139300 2234903
0 John Irrelevant Regardless Awesome Perfect
答案 1 :(得分:1)
我会这样做:
In [127]: np.log(myArray, where=mask, out=out)
Out[127]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
输出:
df_out = pd.concat([df1,df2])
df_out = (df_out[df_out.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]
.reset_index(drop=True)
.set_index(['Name','No.'], append=True)['Comment']
.unstack([0,2]))
df_out.columns = df_out.columns.droplevel(0)
df_out
使用No. 2139300 2234903 2139300 2234903
Name
John Irrelevant Regardless Awesome Perfect
获取每行唯一索引,然后附加' name'并且没有。'到该索引并取消堆叠新的行号索引和否。创建一个多索引列标题,然后删除列标题的顶层。
您可以使用:
reset_index
删除索引名称并创建更多" clean"查表数据框:
df_out.rename_axis(None, axis=1).rename_axis(None)