我有以下数据框(df)
process_response
和一些我需要搜索完全匹配的单词
Comments ID
0 10 Looking for help
1 11 Look at him but be nice
2 12 Be calm
3 13 Being good
4 14 Him and Her
5 15 Himself
这是我想要的输出
word_list = ['look','be','him']
我已经尝试了诸如str.findall
Comments ID Word_01 Word_02 Word_03
0 10 Looking for help
1 11 Look at him but be nice look be him
2 12 Be calm be
3 13 Being good
4 14 Him and Her him
5 15 Himself
和其他一些,但我似乎无法完全匹配我的单词。
任何帮助解决此问题的方法将不胜感激。
谢谢
答案 0 :(得分:1)
您可以使用熊猫的apply
功能。
示例:
import pandas as pd
my_dataframe = pd.DataFrame({'Comments': [10, 11, 12, 13, 14, 15],
'ID': [
'Looking for help',
'Look at him but be nice',
'Be calm',
'Being good',
'Him and Her',
'Himself']
})
print(my_dataframe)
word_list = ['look','be','him']
word_list = ['look','be','him']
for index, word in enumerate(word_list):
def match_word(val):
"""
Under-optimized pattern matching
:param val:
:type val:
:return:
:rtype:
"""
if word.lower() in val.lower():
return word
return None
my_dataframe['Word_{}'.format(index)] = my_dataframe['ID'].apply(match_word)
print(my_dataframe)
输出:
Comments ID
0 10 Looking for help
1 11 Look at him but be nice
2 12 Be calm
3 13 Being good
4 14 Him and Her
5 15 Himself
Comments ID Word_0 Word_1 Word_2
0 10 Looking for help look None None
1 11 Look at him but be nice look be him
2 12 Be calm None be None
3 13 Being good None be None
4 14 Him and Her None None him
5 15 Himself None None him
答案 1 :(得分:1)
每个单词都需要单词边界。使用Series.str.extractall
,DataFrame.add_prefix
和DataFrame.join
到原始DataFrame
的一种可能解决方案:
word_list = ['look','be','him']
pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = df['ID'].str.extractall('(' + pat + ')', flags = re.I)[0].unstack().add_prefix('Word_')
对于输出中的小写数据,请添加Series.str.lower
:
df1 = (df['ID'].str.lower()
.str.extractall('(' + pat + ')')[0]
.unstack()
.add_prefix('Word_'))
df = df.join(df1).fillna('')
print (df)
Comments ID Word_0 Word_1 Word_2
0 10 Looking for help
1 11 Look at him but be nice Look him be
2 12 Be calm Be
3 13 Being good
4 14 Him and Her Him
5 15 Himself
您的解决方案应以相同的方式更改,将值转换为list
s,将join
转换为原始值:
pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = (pd.DataFrame(df['ID']
.str.findall(pat, flags = re.I).values.tolist())
.add_prefix('Word_')
.fillna(''))
或使用列表理解(应该最快):
df1 = (pd.DataFrame([re.findall(pat, x, flags = re.I) for x in df['ID']])
.add_prefix('Word_')
.fillna(''))
对于小写字母,请添加.lower()
:
pat = '|'.join(r"\b{}\b".format(x) for x in word_list)
df1 = (pd.DataFrame([re.findall(pat, x.lower(), flags = re.I) for x in df['ID']])
.add_prefix('Word_')
.fillna(''))