Python / Pandas:创建新的数据帧,获取错误“作为索引器提供的不可对齐的布尔系列”

时间:2017-10-13 18:16:56

标签: python dataframe series

我正在尝试比较两个数据帧,并根据一个数据帧中的值是否存在于另一个数据帧中而返回不同的结果集。

以下是我的示例代码:

pmdf = pd.DataFrame(
        {
        'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
        'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
        }
        )

pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]

jcrdf = pd.DataFrame(
        {
        'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
        'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
        'Total Cites': ['223','444','324','166'],
        'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],   
        'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']                        
         })
jcrdf = jcrdf.set_index('Full Journal Title')

pmdf_issn = pmdf['ISSN'].values.tolist()

这一行从数据帧jcrdf获取包含来自dataframe pmdf的issn的行

pmjcrmatch = jcrdf[jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]

我希望以下行从pmdf创建一个新的数据帧,其中ISSN不在jcfdf中,因此我否定了之前的语句并选择了第一个数据帧。

pmjcrnomatch = pmdf[~jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]

我收到一个错误:“作为索引器提供的不可对齐的布尔系列(布尔系列和索引对象的索引不匹配”

我没有发现很多关于这个特定错误的信息,至少没有什么可以帮助我找到解决方案。

“str.contains”不是排序第二个数据框中和不在第二个数据框中的项目的最佳方式吗?

1 个答案:

答案 0 :(得分:1)

您正尝试将一个数据框的布尔索引应用于另一个数据框。只有两个数据帧的长度匹配时才可以这样做。在您的情况下,您应该使用isin

# get all rows from jcrdf where `ALL_ISSNs` contains any of the `ISSN` in `pmdf`.
pmjcrmatch = jcrdf[jcrdf.All_ISSNs.str.contains('|'.join(pmdf.ISSN))]
# assign all remaining rows from `jcrdf` to a new dataframe.
pmjcrnomatch = jcrdf[~jcrdf.ISSN.isin(pmjcrmatch.ISSN)]

修改 让我们尝试另一种方法:

首先我会为你所有的ISSN创建一个查找,然后通过隔离匹配来创建diff:

import pandas as pd

pmdf = pd.DataFrame(
        {
        'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
        'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
        }
        )

pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]

jcrdf = pd.DataFrame(
        {
        'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
        'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
        'Total Cites': ['223','444','324','166'],
        'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
        'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
         })
jcrdf = jcrdf.set_index('Full Journal Title')

# create lookup from all issns to avoid expansice string matching
jcrdf_lookup = pd.DataFrame(jcrdf['All_ISSNs'].str.split(',').tolist(),
                            index=jcrdf.ISSN).stack(level=0).reset_index(level=0)

# compare extracted ISSNs from ALL_ISSNs with pmdf.ISSN
matches = jcrdf_lookup[jcrdf_lookup[0].isin(pmdf.ISSN)]
jcrdfmatch = jcrdf[jcrdf.ISSN.isin(matches.ISSN)]
jcrdfnomatch = pmdf[~pmdf.ISSN.isin(matches[0])]