比较两个数据帧

时间:2017-10-31 23:11:26

标签: python pandas numpy

这是关于我在这里提出的问题:compare two pandas dataframes with unequal columns

也参考:How to implement 'in' and 'not in' for Pandas dataframe

我创建了两个pandas数据框:

DataFrame:单词

                  0
0           limited
1         desirable
2           advices

DataFrame:mcDonaldWL

            Word       Negative   Positive  Uncertainty
9            abandon     2009        0           0
10         abandoned     2009        0           0
11        desirables        0        2009        0
12       abandonment     2009        0           0
13           advices     2009        0           0
14          abandons     2009        0           0

我的目标是将单词[0]与mcDonaldWL ['Word']进行比较,如果出现第i个元素,则显示结果。

Result 
              Word       Negative   Positive  Uncertainty
 11        desirables        0       2009        0
 13           advices     2009        0           0

我尝试使用set,intersection,merge,但找不到解决方案。有什么想法吗?

它不会产生所需的答案。这不重复。

如果我跑

words[~words.word.isin(mcDonaldWL)] 

我明白了:

    word
0   limited
1   desirable

4 个答案:

答案 0 :(得分:1)

假设你有:

>>> df1
         col1
0     limited
1  desirables
2     advices
>>> df2
           Word  Negative  Positive  Uncertainty
9       abandon      2009         0            0
10    abandoned      2009         0            0
11   desirables         0      2009            0
12  abandonment      2009         0            0
13      advices      2009         0            0
14     abandons      2009         0            0

注意,我已经为您的第一个数据框提供了正确的列标签。无论如何,最简单的方法是使用Word作为索引

>>> df2.set_index('Word', inplace=True)
>>> df2
             Negative  Positive  Uncertainty
Word
abandon          2009         0            0
abandoned        2009         0            0
desirables          0      2009            0
abandonment      2009         0            0
advices          2009         0            0
abandons         2009         0            0

然后你可以使用索引!

>>> df2.loc[df1.col1.values]
            Negative  Positive  Uncertainty
Word
limited          NaN       NaN          NaN
desirables       0.0    2009.0          0.0
advices       2009.0       0.0          0.0
>>> df2.loc[df1.col1.values].dropna()
            Negative  Positive  Uncertainty
Word
desirables       0.0    2009.0          0.0
advices       2009.0       0.0          0.0
>>>

答案 1 :(得分:1)

使用模糊匹配

from fuzzywuzzy import process
l=words.iloc[:,0].values.tolist()

a=[]
for x in mcDonaldWL.Word:
    if [process.extract(x, l, limit=1)][0][0][1]>=80:
        a.append([process.extract(x, l, limit=1)][0][0][0])
    else:
        a.append(np.nan)

mcDonaldWL['canfind']=a
mcDonaldWL.dropna().drop('canfind',1)


Out[494]: 
          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

答案 2 :(得分:1)

方法1

ws = words.values.ravel().astype(str)
wl = mcDonaldWL.Word.values.astype(str)

mcDonaldWL[(np.core.defchararray.find(wl[:, None], ws) >= 0).any(1)]

          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

方法2

mcDonaldWL[mcDonaldWL.Word.str.contains('|'.join(words.values.ravel()))]

          Word  Negative  Positive  Uncertainty
11  desirables         0      2009            0
13     advices      2009         0            0

答案 3 :(得分:0)

words中,您有“理想”,但在mcDonaldWL中,您有“desirables”。假设这些应该是相同的,你可以这样做:

mcDonaldWL.set_index('Word', inplace=True)
mcDonaldWL.loc[words[0]]

此外,“建议”不是一个词。