我有一个很大的CSV文件,其中包含许多短词,我需要将它们更改为一个完整的词。我在这里发现的帖子很少,例如1,2,但是其中大多数要么更改整行,要么需要手动进行。
我的CSV文件如下:
infoID messages
111 we need to fix the car mag but we can't
113 we need a shf to perform eng change
115 gr is needed to change
116 bat needs change
117 car towed for ext change
118 car ml is high
.
.
我的另一个文件包含所有短格式单词的全部单词,我想将其应用到我的文档中,其格式为:
shf:shaft
gr:gear
ml:mileage
如果您能提供我也可以在我身边运行的代码的帮助,那将是非常不错的。谢谢
答案 0 :(得分:4)
以类似于
的系列阅读文本文件s
0 mag:magnitude
1 shf:shaft
2 gr:gear
3 bat:battery
4 ext:exhaust
5 ml:mileage
Name: 0, dtype: object
在冒号上分割,然后将序列转换成字典映射键以替换它:
dict(s.str.split(':').tolist())
# {'bat': 'battery',
# 'ext': 'exhaust',
# 'gr': 'gear',
# 'mag': 'magnitude',
# 'ml': 'mileage',
# 'shf': 'shaft'}
使用此命令对regex=True
执行replace
操作:
df['messages'].replace(dict(s.str.split(':').tolist()), regex=True)
0 we need to fix the car magnitude but we can't
1 we need a shaft to perform eng change
2 gear is needed to change
3 battery needs change
4 car towed for exhaust change
5 car mileage is high
Name: messages, dtype: object
请注意,如果这些严格是整个单词的替换,则可以通过将关键字字符串转换为使用单词边界的正则表达式来扩展此解决方案。为了更好的措施,也请转义字符串:
import re
mapping = {fr'\b{re.escape(k)}\b': v for k, v in s.str.split(':').tolist()}
df['messages'].replace(mapping, regex=True)
0 we need to fix the car magnitude but we can't
1 we need a shaft to perform eng change
2 gear is needed to change
3 battery needs change
4 car towed for exhaust change
5 car mileage is high
Name: messages, dtype: object
答案 1 :(得分:3)
使用pd.Series.apply
的另一种方法:
d = dict(i.split(':') for i in d.split('\n'))
#{'bat': 'battery',
# 'ext': 'exhaust',
# 'gr': 'gear',
# 'mag': 'magnitude',
# 'ml': 'mileage',
# 'shf': 'shaft'}
df['messages'].apply(lambda x : ' '.join(d.get(i, i) for i in x.split()), 1)
输出:
0 we need to fix the car magnitude but we can't
1 we need a shaft to perform eng change
2 gear is needed to change
3 battery needs change
4 car towed for exhaust change
5 car mileage is high
Name: messages, dtype: object