我在Windows OS10上,使用python 2.7.15 |水蟒。每当我运行
mymodel=gensim.models.Word2Vec.load (pretrain)
mymodel.min_count = mincount
sentences =gensim.models.word2vec.LineSentence('ontology_corpus.lst')
mymodel.build_vocab(sentences, update=True) # ERROR HERE ****
我收到此错误:
Traceback (most recent call last):
File "runWord2Vec.py", line 23, in <module>
mymodel.build_vocab(sentences, update=True)
File "C:xxxx\lib\site-packages\gensim\models\ba
se_any2vec.py", line 936, in build_vocab
sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, tri
m_rule=trim_rule)
File "C:xxxx\lib\site-packages\gensim\models\wo
rd2vec.py", line 1591, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_r
ule)
File "C:xxxxx\lib\site-packages\gensim\models\wo
rd2vec.py", line 1560, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:xxxx\lib\site-packages\gensim\models\wo
rd2vec.py", line 1442, in __iter__
line = utils.to_unicode(line).split()
File "C:xxxx\lib\site-packages\gensim\utils.py"
, line 359, in any2unicode
return unicode(text, encoding, errors=errors)
File "C:xxxxx\lib\encodings\utf_8.py", line 16,
in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 124: invalid
continuation byte
现在,这可以追溯到此LineSentence类
class LineSentence(object):
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
self.source = source
self.max_sentence_length = max_sentence_length
self.limit = limit
def __iter__(self):
"""Iterate through the lines in the source."""
try:
# Assume it is a file-like object and try treating it as such
# Things that don't have seek will trigger an exception
self.source.seek(0)
for line in itertools.islice(self.source, self.limit):
line = utils.to_unicode(line).split()
i = 0
while i < len(line):
yield line[i: i + self.max_sentence_length]
i += self.max_sentence_length
except AttributeError:
# If it didn't work like a file, use it as a string filename
with utils.smart_open(self.source) as fin:
for line in itertools.islice(fin, self.limit):
line = utils.to_unicode(line).split() # ERROR HERE *************
i = 0
while i < len(line):
yield line[i: i + self.max_sentence_length]
i += self.max_sentence_length
在从错误中可以看到的最后一个返回中,我可以将error参数更改为error ='ignore'或更改以下行:
utils.to_unicode(line).split()
对此:
line.split()
ontology_corpus.lst文件样本:
<http://purl.obolibrary.org/obo/GO_0090141> EquivalentTo <http://purl.obolibrary.org/obo/GO_0065007> and <http://purl.obolibrary.org/obo/RO_0002213> some <http://purl.obolibrary.org/obo/GO_0000266>
<http://purl.obolibrary.org/obo/GO_0090141> SubClassOf <http://purl.obolibrary.org/obo/GO_0065007>
问题是它正在运行,但由于忽略了编码错误,恐怕结果会是有缺陷的!有解决方案吗?还是我的方法会很好?
答案 0 :(得分:1)
这可能是因为文件中的某些行或某些行包含的数据未经UTF8正确编码。
如果build_vocab()
以其他方式获得成功,那么如果腐败是无意的,罕见的或不影响您特别感兴趣的词向量的,则可能不会对最终结果产生太大的影响。(您的示例行不包含任何UTF8损坏,也不包含可能存在编码问题的字符。)
但是,如果有问题,您可以尝试自己阅读sentences
来触发build_vocab()
之外的错误,从而找出问题的确切根源。例如:
for i, sentence in enumerate(sentences):
print(i)
在哪里停止(如果是结束迭代的错误),或者错误消息与行号交错的地方,将提示您存在问题的行。您可以检查文本编辑器中的字符,以查看涉及哪些字符。然后,您可以考虑使用相关范围/字符的知识,考虑删除/更改这些字符,或尝试发现文件的真实编码并将其重新编码为UTF8。
(关于您的明显语料库的另一条注释:请注意,如果将单个标记的许多替代示例分布在整个语料库中,并与其他标记的对比示例进行交错,则单词向量训练是最佳的。因此,如果您的语料库是转储的从其他所有将所有相关标记(例如<http://purl.obolibrary.org/obo/GO_0090141>
)聚集在一起的来源中,如果您在训练之前对这些行进行混洗,则可能会得到最终向量的改进。)