如何使用来自熊猫另一列的unicode文本创建新列?

时间:2019-08-27 02:00:44

标签: python pandas

我有一个包含一些unicode的pandas数据框,我想用dogcatNone创建新列。

这是我的数据框:

df = pd.DataFrame({'comment': ['Alice likes ?', 'Bob likes ?', 'Harry likes dog', 'Don likes cat!', 'this is a tree']})

如何创建这样的新列?

           comment label
0    Alice likes ?   dog
1      Bob likes ?   dog
2  Harry likes dog   dog
3   Don likes cat!   cat
4   this is a tree  None

注意:我的猫和狗表情符号很少,可以手动构建字典。

dict_dog = {'dog': ['dog', "?", "?"]}
dict_cat = {'cat': ['cat']

然后我为如何进行而苦恼。

4 个答案:

答案 0 :(得分:1)

您可以像

一样创建dict
dog = dict.fromkeys(['dog', "?", "?"],'dog')

cat= dict.fromkeys(['cat'],'cat')

然后我们使用与str.findall之前相同的逻辑

d = {**dog ,**cat}
df.comment.str.findall('|'.join(d.keys())).str[0].map(d)

答案 1 :(得分:1)

尝试

df['label']= np.where( df['comment'].str.contains('(dog| ?|?)'), 'dog','cat')

如果动物数量多于2种动物,则可以嵌套np.where

df['label']= (np.where( df['comment'].str.contains('(dog| ?|?)'),'dog', 
                   (np.where(df['comment'].str.contains('cat'), 'cat','None'))))

答案 2 :(得分:1)

这是使用 Regex Apply()

的另一种方法
import re

decoder = {'dog': ['dog', "?", "?"], 'cat': ['cat']}

def check(c):
    c = list(map(lambda l: re.sub('[!@#$]', '', l), c.split(' ')))
    res_dog = [i for i in c if i in decoder['dog']]
    res_cat = [i for i in c if i in decoder['cat']]
    return 'dog' if res_dog else 'cat' if res_cat else None

# Apply function
df['label'] = df['comment'].apply(check)

结果:

         comment    label
0   Alice likes ?  dog
1   Bob likes ?    dog
2   Harry likes dog dog
3   Don likes cat!  cat
4   this is a tree  None

答案 3 :(得分:1)

这适用于大写和小写字母。 这是在熊猫中创建列的推荐方法。如果比较简单的方法行得通,请尝试一下,然后再尝试复杂的方法。

import numpy as np
import pandas as pd

df = pd.DataFrame({'comment': ['Alice likes ?', 'Bob likes ?', 'Harry likes dog', 'Don likes cat!', 'this is a tree']})


df['comment'] = df['comment'].astype(str)
df['label'] = 'None'

df.loc[df.comment.str.lower().str.contains("dog"),'label'] = 'dog'
df.loc[df.comment.str.lower().str.contains("cat"),'label'] = 'dog'

df.loc[df.comment.str.contains("?"),'label'] = 'dog'
df.loc[df.comment.str.contains("?"),'label'] = 'dog'

print(df)

           comment label
0    Alice likes ?   dog
1      Bob likes ?   dog
2  Harry likes dog   dog
3   Don likes cat!   dog
4   this is a tree  None