我正在尝试替换数据框(C列)中的列表名称:
姓名列表(小示例,该列表太大):
types
小数据框示例:
Jack
Liam
John
Ethan
George
...
我的脚本:
A B C
French house Phone <phone_numbers>
English house email <adresse_mail>
French apartment my name is Liam
French house Hello George
English apartment Ethan, my phone is <phone_numbers>
作为输出,我有整个数据框,没有任何修改...
我期望的是
import re
import pandas as pd
from pandas import Series
df = pd.read_excel('data_frame.xlsx')
data = Series.to_string(df['C'])
first_names = open('names_list.txt', 'r')
names_read = first_names.readlines()
def names(data):
names_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, names_read)))
replace_names = names_regex.sub('<name>', data)
return replace_names
no_names = names(data)
print(no_names)
答案 0 :(得分:1)
'No trade'
您可以根据您的姓名列表定义一个正则表达式。然后,将这些值与列name_list = ['Jack', 'Liam', 'John', 'Ethan']
mydf = pd.DataFrame({'C': ['Phone <phone_numbers>', 'email <adresse_mail>', 'my name is Liam', 'Hello George', 'Ethan, my phone is <phone_numbers>']})
匹配,并用C
apply lambda
输出
match = mydf.C.str.extractall('(' + '|'.join(name_list) + ')').reset_index().set_index('level_0').rename(columns={0: 'name'})
mydf = pd.concat([mydf, match], axis=1)
condition = mydf.match.notnull()
mydf.loc[condition, 'C'] = mydf[condition].apply(lambda x: x['C'].replace(x['name'], '<name>'), axis=1)
答案 1 :(得分:1)
您可以简单地通过遍历给定列的值来替换它们:
import pandas as pd
l = [
['French','house','Phone <phone_numbers>'],
['English','house','email <adresse_mail>'],
['French','apartment','my name is Liam'],
['French','house','Hello George'],
['English','apartment','Ethan, my phone is <phone_numbers>']
]
names = [
'Jack',
'Liam',
'John',
'Ethan',
'George'
]
df = pd.DataFrame(l, columns = list('ABC'))
for i in names:
df.C = df.C.str.replace(i,'<name>')
print(df)