如何在字符串的pandas列中选择和替换main关键字?

时间:2019-02-25 06:51:10

标签: pandas dataframe string-substitution

这是我的数据

Id  Keyword
1   ayam e-commerce
2   biaya fuel personal wallet
3   pulsa sms virtualaccount
4   biaya koperasi personal
5   familymart personal
6   e-commerce pln
7   biaya onus
8   koperasi personal
9   biaya familymart personal
10  fuel personal wallet
11  fuel travel

我希望存在fuelplnayam等关键字的每个关键字都缩短为fuelpln或{{ 1}},所以输出将变成这样

ayam

我应该怎么做?

1 个答案:

答案 0 :(得分:1)

要只替换第一个匹配的单词,请在循环中使用contains

L = ['fuel', 'pln', 'ayam']
for x in L:
    df.loc[df['Keyword'].str.contains(x), 'Keyword'] = x

或嵌套列表理解:

L = ['fuel', 'pln', 'ayam']
df['Keyword'] = [next(iter([z for z in L if z in x]), x) for x in df['Keyword']]

extractfillna将丢失的值替换为原始值:

L = ['fuel', 'pln', 'ayam']
pat = '|'.join(r"\b{}\b".format(x) for x in L)
df['Keyword'] = df['Keyword'].str.extract('('+ pat + ')', expand=False).fillna(df['Keyword'])


print (df)
    Id                    Keyword
0    1                       ayam
1    2                       fuel
2    3   pulsa sms virtualaccount
3    4    biaya koperasi personal
4    5        familymart personal
5    6                        pln
6    7                 biaya onus
7    8          koperasi personal
8    9  biaya familymart personal
9   10                       fuel
10  11                       fuel

如果需要所有匹配的值,请将findalljoin一起使用,并用loc将非空值替换为原始值:

print (df)
   Id                   Keyword
0   1           ayam e-commerce
1   2     biaya fuel pln wallet <- matched 2 keywords
2   3  pulsa sms virtualaccount

pat = '|'.join(r"\b{}\b".format(x) for x in L)
s = df['Keyword'].str.findall('('+ pat + ')').str.join(', ')
df.loc[s != '', 'Keyword'] = s
print (df)
   Id                   Keyword
0   1                      ayam
1   2                 fuel, pln
2   3  pulsa sms virtualaccount