Python:用字典键替换pandas df中的全字词典值

时间:2017-07-22 16:10:50

标签: python regex pandas dictionary

问题: 我需要匹配和替换pandas df列'messages'中的整个单词和字典值。 有什么方法可以在df [“column”]中执行此操作.replace命令?或者我是否需要找到另一种方法来替换整个单词?

背景 在我的熊猫数据框中,我有一列文本消息,其中包含英文人名按键,我试图用“名字”的字典值替换。数据框中的特定列如下所示,您可以在其中将“tommy”视为单个名称。

 tester.df["message"]
          message  
    0                               what do i need to do   
    1                               what do i need to do   
    2  hi tommy thank you for contacting app ...   
    3  hi tommy thank you for contacting  app ...   
    4  hi we are just following up to see if you read... 

字典是根据我从2000年人口普查数据库中提取的列表创建的。它有许多不同的名字,可以匹配内联文本,包括'al'或'tom',如果我不小心,可以在pandas df列消息的任何地方放置我的值“First Name”:

 import requests 

#import the total name 
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')

#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)


#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)

str1 = ','.join(list1)
str1 = (str1.lower())

#turn into dictionary with "First Name" as value

str1 = dict((el, 'FirstName') for el in str1)

现在我想将DF列“message”中的整个单词替换为与'FirstName'值匹配的字典键。不幸的是,当我执行以下操作时,它会替换消息中的文本,即使是“al”或“tom”等短名称也会匹配。

In [254]: tester["message"].replace(str1, regex = True)
Out[254]: 
0                   wFirstNamet do i neFirstName to do
1                   wFirstNamet do i neFirstName to do
2    hi FirstNameFirstName tFirstName you for conFi...
3    hi FirstNameFirstName tFirstName you for conFi...
4    hi we are just followFirstNameg up to FirstNam...
Name: message, dtype: object

任何有助于匹配和替换整个密钥的帮助都值得赞赏!

更新/尝试修复1 :尝试添加一些正则表达式功能以仅匹配整个单词**

我尝试在提取的字符串中为每个单词添加一个中断字符,该字符串是构造字典的字典。不幸的是,单斜线是有限的单词,变成双斜线并且与字典键不匹配 - >价值取代。

#import the total name 
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
l = requests.get('https://deron.meranda.us/data/popular-last.txt')
#US Census first names
list1= re.findall(r'\n(.*?)\s', r.text, re.DOTALL)

#add regex before

string = 'r"\\'
endstring = '\\b'

list1 = [ string + x + endstring  for x in list1]

#turn list to string, force lower case
str1 = ', '.join('"{0}"'.format(w) for w in list1)

str1 = ','.join(list1)
str1 = (str1.lower())


##if we do print(str1) it shows one backslash 
##turn to list ..but print() doesn't let us have one backlash anymore 

str1 = [x.strip() for x in str1.split(',')]



#turn to dictionary with "firstname"
str1 = dict((el, 'FirstName') for el in str1)

然后当我尝试使用break正则表达式匹配并替换更新的字典键时,我得到一个糟糕的逃避

tester["message"].replace(str1, regex = True)

“Traceback(最近一次调用最后一次):     错误:糟糕的逃脱\ j“

这可能是正确的方向,但双反斜杠转换的反斜杠似乎很棘手......

1 个答案:

答案 0 :(得分:1)

首先,您需要准备名称列表,使其匹配前面带有字符串开头(\s)或空格($)的名称,后面跟一个空格或字符串的结尾(first_names)。然后,您需要确保保留前面和后面的元素(通过反向引用)。假设您有一个列表replacement_dict = { r'(^|\s){}($|\s)'.format(name): r'\1FirstName\2' for name in first_names } ,其中包含应替换的所有名字:

(         # Start group.
  ^|\s    # Match either beginning of string or whitespace.
)         # Close group.
{}        # This is where the actual name will be inserted.
(
  $|\s    # Match either end of string or whitespace.
)

让我们来看看正则表达式:

\1     # Backreference; whatever was matched by the first group.
FirstName
\2     # Backreference; whatever was matched by the second group.

替换正则表达式:

d <- data_frame(ID = rep(sample(500),each = 20))