我有两个数据集,分别是df1和df:
df1
df1 = pd.DataFrame({'ids': [101,102,103],'vals': ['apple','java','python']})
ids vals
0 101 apple
1 102 java
2 103 python
df
df = pd.DataFrame({'TEXT_DATA': [u'apple a day keeps doctor away', u'apple tree in my farm', u'python is not new language', u'Learn python programming', u'java is second language']})
TEXT_DATA
0 apple a day keeps doctor away
1 apple tree in my farm
2 python is not new language
3 Learn python programming
4 java is second language
我想做的是基于过滤后的数据更新列值,并将匹配数据映射到新列,以使我的输出为
TEXT_DATA NEW_COLUMN
0 apple a day keeps doctor away 101
1 apple tree in my farm 101
2 python is not new language 103
3 Learn python programming 103
4 java is second language 102
我尝试使用
进行匹配df[df['TEXT_DATA'].str.contains("apple")]
有什么办法可以做到这一点?
答案 0 :(得分:1)
您可以执行以下操作:
my_words = {'python': 103, 'apple': 101, 'java': 102}
for word in my_words.keys():
df1.loc[df1['my_column'].str.contains(word, na=False), ['my_second_column']] = my_words[word]
答案 1 :(得分:1)
首先,您需要提取df1['vals']
中的值。然后,创建一个新列并将提取结果添加到新列中。最后,合并两个数据框。
extr = '|'.join(x for x in df1['vals'])
df['vals'] = df['TEXT_DATA'].str.extract('('+ extr + ')', expand=False)
newdf = pd.merge(df, df1, on='vals', how='left')
要选择结果中的字段,请在标题部分输入列名:
newdf[['TEXT_DATA','ids']]
答案 2 :(得分:0)
您可以同时使用两个数据帧的cartesian product,然后选择相关的行和列。
tmp = df.assign(key=1).merge(df1.assign(key=1), on='key').drop(columns='key')
resul = tmp.loc[tmp.apply(func=(lambda x: x.vals in x.TEXT_DATA), axis=1)]\
.drop(columns='vals').reset_index(drop=True)