使用字典替换Pandas列中字符串中的字符串

时间:2017-09-21 11:13:53

标签: python pandas dictionary dataframe replace

我正在尝试使用dictionary keystrings列中的pandas替换为values。但是,每列包含句子。因此,我必须首先对句子进行标记并检测句子中的单词是否与我的字典中的键相对应,然后用相应的值替换该字符串。

然而,结果我继续得不到它。是否有更好的pythonic方法来解决这个问题?

这是我目前的MVC。在评论中,我指出了问题发生的位置。

import pandas as pd

data = {'Categories': ['animal','plant','object'],
    'Type': ['tree','dog','rock'],
        'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
}

ids = {'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']}


df = pd.DataFrame(data)
ids = pd.DataFrame(ids)

def col2dict(ids):
    data = ids[['Id', 'City']]
    idDict = data.set_index('Id').to_dict()['City']
    return idDict

def replaceIds(data,idDict):
    ids = idDict.keys()
    types = idDict.values()
    data['commentTest'] = data['Comment']
    words = data['commentTest'].apply(lambda x: x.split())
    for (i,word) in enumerate(words):
        #Here we can see that the words appear
        print word
        print ids
        if word in ids:
        #Here we can see that they are not being recognized. What happened?
            print ids
            print word
            words[i] = idDict[word]
            data['commentTest'] = ' '.apply(lambda x: ''.join(x))
    return data

idDict = col2dict(ids)
results = replaceIds(df, idDict)

结果:

None

我正在使用python2.7,当我打印dict时,有u'的Unicode。

我的预期结果是:

分类

注释

类型

commentTest

  Categories  Comment  Type commentTest
0 animal  The NYC tree is very big tree The New York City tree is very big 
1 plant The cat from the UK is small dog  The cat from the United Kingdom is small 
2 object  The rock was found in LA. rock  The rock was found in Los Angeles. 

2 个答案:

答案 0 :(得分:5)

您可以创建this,然后replace

const curry = f => x => y =>
  f (x,y)
  
const mult = (x,y) =>
  x * y

const multByThree =
  curry (mult) (3)
  
console.log (multByThree (10)) // 30

答案 1 :(得分:0)

实际上使用 str.replace() 比使用 replace() 快得多,即使 str.replace() 需要循环:

ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}

for old, new in ids.items():
    df['Comment'] = df['Comment'].str.replace(old, new, regex=False)

#   Categories  Type                                   Comment
# 0     animal  tree        The New York City tree is very big
# 1      plant   dog  The cat from the United Kingdom is small
# 2     object  rock         The rock was found in Los Angeles

唯一一次 replace() 优于 str.replace() 循环是使用小数据帧:

timings for str.replace vs replace

参考时序函数:

def Series_replace(df):
    df['Comment'] = df['Comment'].replace(ids, regex=True)
    return df

def Series_str_replace(df):
    for old, new in ids.items():
        df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
    return df

请注意,如果 ids 是数据框而不是字典,则使用 itertuples() 可以获得相同的性能:

ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})

for row in ids.itertuples():
    df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)