我的目标是读取文件中的行,并将所有特殊字符(如法语字符(à,é,ç,...))替换为普通字符(a,e,c,...)
我使用Python 3,并且在gensim文档中,该示例使用一个简单的语句(例如:deaccent(“àéç))工作,但不适用于我从文件中读取的行 目前,我的代码只得到“àéç”而不是“ aec”
from gensim.utils import deaccent
def getTextFromFile(filename):
with open(filename) as file:
text = [line.rstrip() for line in file.readlines()]
file.close()
for line in text:
print(deaccent(line))
return text
我的文件包含:àéç
我想得到:aec
答案 0 :(得分:0)
据我所知,它工作正常:
Python 3.7.0 (default, Aug 22 2018, 20:50:05)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from gensim.utils import deaccent
In [2]: deaccent('àéç')
Out[2]: 'aec'
In [3]: astr = 'àéç'
In [4]: dstr = deaccent(astr)
In [5]: print(dstr)
aec
如果您想让getTextFromFile()
方法返回没有重音符号的文本,请不要返回原始的text
,而是返回deaccent()
调用的结果