从部分字符串匹配

时间:2018-02-27 18:17:18

标签: python pandas dataframe

我有一个相对简单的数据框,如下所示(见下文)。其中一列" Book"是一个字符串列表。

我的目标是为" Book"中的三个不同值中的每一个创建新的数据帧。也就是说,一个数据框,每个产品都出现在国际,每个产品都出现在国内和订阅中。

我不知道如何制作一个新的数据框,该数据框是通过匹配现有数据框中的部分字符串构建的。是否有内置功能,或者我应该构建一个迭代数据帧的循环,然后构建一个新的循环?

DF

    Description      Book                               Product ID
0   Products      International, Domestic                 X11
1   Products      International                           X12
2   Products      Domestic                                X13
3   Products      Domestic, International                 X21
4   Services      Subscription, Domestic                  X23
5   Services      International, Domestic                 X23
6   Services      Subscription, International, Domestic   X25

我尝试过使用Pandas isin功能的不同组合,但这需要您知道您要查找的确切字符串。在我的情况下,Book列可以包含三个值的任何顺序,因此我无法成功使用isin。

我尝试的循环示例是:

f = []
for index,row in df.iterrows():
    if "International" in row['Book']:
        f.append 

然而,这会创建一个空列表,我知道这是对的。我没有那么强大的构建数据帧循环,任何建议都非常感谢。

我的目标输出是数据框,如下所示:

DF

    Description      Book                               Product ID
0   Products      International                           X11
1   Products      International                           X12
2   Products      International                           X21
3   Services      International                           X23
4   Services      International                           X25

并且

DF

    Description   Book                               Product ID
0   Products      Domestic                                X11
2   Products      Domestic                                X13
3   Products      Domestic                                X21
4   Services      Domestic                                X23
5   Services      Domestic                                X25

同样适用于Subscription。我已经查看了其他多个SO问题,并且无法找到有助于这种情况的问题。

3 个答案:

答案 0 :(得分:1)

我不确定您尝试过的代码是否真的有机会工作。您是否尝试过以下方法:

f = []
for index,row in df.iterrows():
    if "International" in row['Book']:
        f.append(row)

最后请注意f.append(row)

这可能不是最佳方式。

我会尝试以下各种类型的内容,它们会为您提供3个更适合分组的列(df.groupby),它会为您提供每个类别中的产品列表。

df['International'] = df.apply(lambda r: 'International' in r['Book'])
df['Domestic'] = df.apply(lambda r: 'Domestic' in r['Book'])
df['Subscription'] = df.apply(lambda r: 'Subscription' in r['Book'])

答案 1 :(得分:1)

我在评论时使用get_dummies

s=df.Book.str.get_dummies(sep=',')
[df[s[x]==1].assign(Book=x) for x in s.columns]
Out[198]: 
[  Description      Book ProductID
 0    Products  Domestic       X11
 2    Products  Domestic       X13
 3    Products  Domestic       X21
 4    Services  Domestic       X23
 5    Services  Domestic       X23
 6    Services  Domestic       X25,   Description           Book ProductID
 0    Products  International       X11
 1    Products  International       X12
 3    Products  International       X21
 5    Services  International       X23
 6    Services  International       X25,   Description          Book ProductID
 4    Services  Subscription       X23
 6    Services  Subscription       X25]

答案 2 :(得分:1)

另一种方式:

国际:

df_international = df[df['Book'].str.contains('International')].reset_index(drop=True)
df_international.loc[:, 'Book'] = 'International'
print(df_international)
#      Description           Book Product ID
#0        Products  International        X11
#1        Products  International        X12
#2        Products  International        X21
#3        Services  International        X23
#4        Services  International        X25

国内:

df_domestic = df[df['Book'].str.contains('Domestic')].reset_index(drop=True)
df_domestic.loc[:, 'Book'] = 'Domestic'
print(df_domestic)
#      Description      Book Product ID
#0        Products  Domestic        X11
#1        Products  Domestic        X13
#2        Products  Domestic        X21
#3        Services  Domestic        X23
#4        Services  Domestic        X23
#5        Services  Domestic        X25

订阅:

df_subscription = df[df['Book'].str.contains('Subscription')].reset_index(drop=True)
df_subscription.loc[:, 'Book'] = 'Subscription'
print(df_subscription)
#      Description          Book Product ID
#0        Services  Subscription        X23
#1        Services  Subscription        X25