熊猫:如何在列中搜索一组字符串?

时间:2018-09-07 14:33:45

标签: string pandas dataframe

我有一个数据框,其中包含带有推文的列。这些文本包含所谓的“ @”提及。我想在此数据框中添加一个新列,其中包含在该行中找到的特定“ @”提及。代码:

dfEx5.text.apply(str) #Convert all elements in the text-column to a string-type

dfEx5['mentions'] = pd.np.where(dfEx5.text.str.contains("@AmericanAir"), "@AmericanAir",
                    pd.np.where(dfEx5.text.str.contains("@JetBlue"), "@JetBlue",
                    pd.np.where(dfEx5.text.str.contains("@SouthwestAir"), "@SouthwestAir",
                    pd.np.where(dfEx5.text.str.contains("@united"), "@united",
                    pd.np.where(dfEx5.text.str.contains("@USAirways"), "@USAirways",
                    pd.np.where(dfEx5.text.str.contains("@VirginAmerica"), "@VirginAmerica",))))))

首先,我将所有元素都转换为字符串类型。如果该列中包含“ @AmericanAir”,则在提及列中添加“ @AmericanAir”,等等。

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

pandas.Series.str.findall

我会在我的监视组中找到所有提及的内容,并进行第一个提及。

df.text.str.findall('|'.join(watch)).str[0]

0      @AmericanAir
1          @JetBlue
2     @SouthwestAir
3           @united
4        @USAirways
5    @VirginAmerica
Name: text, dtype: object

通过assign

将其添加到新列中
df.assign(mentions=df.text.str.findall('|'.join(watch)).str[0])

                    text        mentions
0  @AmericanAir @JetBlue    @AmericanAir
1               @JetBlue        @JetBlue
2          @SouthwestAir   @SouthwestAir
3  @united @SouthwestAir         @united
4             @USAirways      @USAirways
5         @VirginAmerica  @VirginAmerica

如果愿意,您可以保留所有提及内容

df.assign(mentions=df.text.str.findall('|'.join(watch)))

                    text                  mentions
0  @AmericanAir @JetBlue  [@AmericanAir, @JetBlue]
1               @JetBlue                [@JetBlue]
2          @SouthwestAir           [@SouthwestAir]
3  @united @SouthwestAir  [@united, @SouthwestAir]
4             @USAirways              [@USAirways]
5         @VirginAmerica          [@VirginAmerica]

设置

watch = [
    '@SouthwestAir',
    '@VirginAmerica',
    '@united',
    '@JetBlue',
    '@USAirways',
    '@AmericanAir'
]
text = """\
@AmericanAir @JetBlue
@JetBlue
@SouthwestAir
@united @SouthwestAir
@USAirways
@VirginAmerica
"""
df = pd.DataFrame(dict(text=text.splitlines()))

df

                    text
0  @AmericanAir @JetBlue
1               @JetBlue
2          @SouthwestAir
3  @united @SouthwestAir
4             @USAirways
5         @VirginAmerica