熊猫:跨数据集列匹配

时间:2017-03-09 17:32:35

标签: python pandas

有多个数据集,我想知道它们之间是如何相互联系的。例如。如果数据集A和B中的字符串列有许多共同的值,那么这可能是一个链接。是否可以自动进行这种分析?

1 个答案:

答案 0 :(得分:0)

你总是可以将它们变成数据帧并检查这种方式。根据数据的大小,可能会很慢。但这是一种非常基本的方式,下面的代码为学习目的创建了额外的数据帧,而不是最好的代码,但我希望你能看到进展。

import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['Pharmacy of IDAHO','Access medicare arkansas','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
df2 = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN], 
                    'B' : [1,0,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['Pharmacy of IDAHO','Arkansas','NJ Pharmacy','Idaho Rockies?','CA Herbals','blah blah','AK RX','test_test','PA Rx','USA4Lyfe'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN]})
#Creates a Column in DF2 If Matching
df2['Values']= df['C'] == df2['C']
#Creates another dataframe where the values are only True
df3 = df2[df2['Values']== True]
#Prints the length of the DataFrame which actually gives you the amount of common values
print("There are",len(df3), "Occurences")

输出: There are 5 Occurences