我正在处理一个数据框,该数据框在文本列中包含很多缩写。使用预定义的词典,我用全词替换了缩写词,并且可以正常工作。
但是缩写似乎已被替换多次。如果替换缩写的完整单词包含另一个缩写,则将再次替换该缩写:
d = {' h ' : ' height ', ' mm ' : ' milimeter ', ' w ' : 'width', ' iaw ' : ' in accordance with ', ' in ' : ' input '}
dt = {"Number":[1, 2], "text": ["measure depth 22 mm h 24 mm w 75 mm", "wheel 4 iaw amm"]}
dataframe = pd.DataFrame(dt)
def process_data(file_name):
data = file_name
data["text"].replace(d, regex=True, inplace=True)
return data
df = process_data(dataframe)
print(df)
其结果是:
Number text
0 1 measure depth 22 milimeter height 24 milimeter w 75 mm
1 2 wheel 4 input accordance with amm
应为:
Number text
0 1 measure depth 22 milimeter height 24 milimeter w 75 mm
1 2 wheel 4 in accordance with amm
有人知道如何解决这个问题吗?
答案 0 :(得分:1)
您可以将功能Series.str.replace
与regex
一起使用:
#removed whitespaces
d = {'h' : 'height',
'mm' : 'milimeter',
'w' : 'width',
'iaw' : 'in accordance with',
'in' : 'input'}
pat = '|'.join(r"\b{}\b".format(x) for x in d.keys())
dataframe['keyword'] = dataframe['text'].str.replace(pat, lambda x: d[x.group()], regex=True)
print (dataframe)
Number text \
0 1 measure depth 22 mm h 24 mm w 75 mm
1 2 wheel 4 iaw amm
keyword
0 measure depth 22 milimeter height 24 milimeter...
1 wheel 4 in accordance with amm
另一种解决方案是用空格分割值,用get
和join
的字典映射,再用space
返回:
f = lambda x: ' '.join(d.get(y, y) for y in x.split())
dataframe['keyword'] = dataframe['text'].apply(f)
print (dataframe)
Number text \
0 1 measure depth 22 mm h 24 mm w 75 mm
1 2 wheel 4 iaw amm
keyword
0 measure depth 22 milimeter height 24 milimeter...
1 wheel 4 in accordance with amm