Gensim句子来自本体语料库Unicode错误

时间:2019-05-16 13:55:25

标签: python unicode gensim word2vec

我在Windows OS10上,使用python 2.7.15 |水蟒。每当我运行

mymodel=gensim.models.Word2Vec.load (pretrain)
mymodel.min_count = mincount
sentences =gensim.models.word2vec.LineSentence('ontology_corpus.lst')
mymodel.build_vocab(sentences, update=True) # ERROR HERE ****

我收到此错误:

Traceback (most recent call last):
  File "runWord2Vec.py", line 23, in <module>
    mymodel.build_vocab(sentences, update=True)
  File "C:xxxx\lib\site-packages\gensim\models\ba
se_any2vec.py", line 936, in build_vocab
    sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, tri
m_rule=trim_rule)
  File "C:xxxx\lib\site-packages\gensim\models\wo
rd2vec.py", line 1591, in scan_vocab
    total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_r
ule)
  File "C:xxxxx\lib\site-packages\gensim\models\wo
rd2vec.py", line 1560, in _scan_vocab
    for sentence_no, sentence in enumerate(sentences):
  File "C:xxxx\lib\site-packages\gensim\models\wo
rd2vec.py", line 1442, in __iter__
    line = utils.to_unicode(line).split()
  File "C:xxxx\lib\site-packages\gensim\utils.py"
, line 359, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "C:xxxxx\lib\encodings\utf_8.py", line 16,
in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 124: invalid
 continuation byte

现在,这可以追溯到此LineSentence类

class LineSentence(object):

def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):

    self.source = source
    self.max_sentence_length = max_sentence_length
    self.limit = limit

def __iter__(self):
    """Iterate through the lines in the source."""
    try:
        # Assume it is a file-like object and try treating it as such
        # Things that don't have seek will trigger an exception
        self.source.seek(0)
        for line in itertools.islice(self.source, self.limit):
            line = utils.to_unicode(line).split()
            i = 0
            while i < len(line):
                yield line[i: i + self.max_sentence_length]
                i += self.max_sentence_length
    except AttributeError:
        # If it didn't work like a file, use it as a string filename
        with utils.smart_open(self.source) as fin:
            for line in itertools.islice(fin, self.limit):
                line = utils.to_unicode(line).split() # ERROR HERE *************
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length

在从错误中可以看到的最后一个返回中,我可以将error参数更改为error ='ignore'或更改以下行:

 utils.to_unicode(line).split()

对此:

 line.split()

ontology_corpus.lst文件样本:

<http://purl.obolibrary.org/obo/GO_0090141> EquivalentTo <http://purl.obolibrary.org/obo/GO_0065007> and  <http://purl.obolibrary.org/obo/RO_0002213> some <http://purl.obolibrary.org/obo/GO_0000266> 
<http://purl.obolibrary.org/obo/GO_0090141> SubClassOf <http://purl.obolibrary.org/obo/GO_0065007>

问题是它正在运行,但由于忽略了编码错误,恐怕结果会是有缺陷的!有解决方案吗?还是我的方法会很好?

1 个答案:

答案 0 :(得分:1)

这可能是因为文件中的某些行或某些行包含的数据未经UTF8正确编码。

如果build_vocab()以其他方式获得成功,那么如果腐败是无意的,罕见的或不影响您特别感兴趣的词向量的,则可能不会对最终结果产生太大的影响。(您的示例行不包含任何UTF8损坏,也不包含可能存在编码问题的字符。)

但是,如果有问题,您可以尝试自己阅读sentences来触发build_vocab()之外的错误,从而找出问题的确切根源。例如:

for i, sentence in enumerate(sentences):
    print(i)

在哪里停止(如果是结束迭代的错误),或者错误消息与行号交错的地方,将提示您存在问题的行。您可以检查文本编辑器中的字符,以查看涉及哪些字符。然后,您可以考虑使用相关范围/字符的知识,考虑删除/更改这些字符,或尝试发现文件的真实编码并将其重新编码为UTF8。

(关于您的明显语料库的另一条注释:请注意,如果将单个标记的许多替代示例分布在整个语料库中,并与其他标记的对比示例进行交错,则单词向量训练是最佳的。因此,如果您的语料库是转储的从其他所有将所有相关标记(例如<http://purl.obolibrary.org/obo/GO_0090141>)聚集在一起的来源中,如果您在训练之前对这些行进行混洗,则可能会得到最终向量的改进。)