如何在textacy 0.6.2中初始化`Doc`?

时间:2018-07-19 20:21:06

标签: python nlp textacy

尝试遵循Python 2中的simple Doc initialization in the docs无效:

>>> import textacy
>>> content = '''
...     The apparent symmetry between the quark and lepton families of
...     the Standard Model (SM) are, at the very least, suggestive of
...     a more fundamental relationship between them. In some Beyond the
...     Standard Model theories, such interactions are mediated by
...     leptoquarks (LQs): hypothetical color-triplet bosons with both
...     lepton and baryon number and fractional electric charge.'''
>>> metadata = {
...     'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
...     'author': 'Burton DeWilde',
...     'pub_date': '2012-08-01'}
>>> doc = textacy.Doc(content, metadata=metadata)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 120, in __init__
    {compat.unicode_, SpacyDoc}, type(content)))
ValueError: `Doc` must be initialized with set([<type 'unicode'>, <type 'spacy.tokens.doc.Doc'>]) content, not "<type 'str'>"

一个字符串或一系列字符串的简单初始化应该是什么样?

更新

unicode(content)传递到textacy.Doc()会弹出

ImportError: 'cld2-cffi' must be installed to use textacy's automatic language detection; you may do so via 'pip install cld2-cffi' or 'pip install textacy[lang]'.

从安装textacy的那一刻开始,这真是太好了。

即使安装了cld2-cffi之后,尝试上面的代码也会抛出

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 114, in __init__
    self._init_from_text(content, metadata, lang)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 136, in _init_from_text
    spacy_lang = cache.load_spacy(langstr)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/cachetools/__init__.py", line 46, in wrapper
    v = func(*args, **kwargs)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/cache.py", line 99, in load_spacy
    return spacy.load(name, disable=disable)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/util.py", line 120, in load_model
    raise IOError("Can't find model '%s'" % name)
IOError: Can't find model 'en'

2 个答案:

答案 0 :(得分:1)

如回溯所示,问题出在_init_from_text()函数中的textacy/doc.py处,该函数尝试检测语言并在第136行中使用字符串'en'对其进行调用。( spacythis issue comment.中对此进行了回购

我通过提供有效的lang的{​​{1}}(unicode)字符串以及在u'en_core_web_sm'content参数字符串中使用unicode来解决了这个问题。

lang

使用字符串而不是unicode字符串(带有错误的错误消息)会改变行为,事实是缺少程序包,并且使用import textacy content = u''' The apparent symmetry between the quark and lepton families of the Standard Model (SM) are, at the very least, suggestive of a more fundamental relationship between them. In some Beyond the Standard Model theories, such interactions are mediated by leptoquarks (LQs): hypothetical color-triplet bosons with both lepton and baryon number and fractional electric charge.''' metadata = { 'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV', 'author': 'Burton DeWilde', 'pub_date': '2012-08-01'} doc = textacy.Doc(content, metadata=metadata, lang=u'en_core_web_sm') 语言字符串的方式可能已经过时/也许不全面在我看来像是虫子。 ‍♂️

答案 1 :(得分:0)

似乎您正在使用Python 2并出现unicode错误。在textacy docs中,有关于使用Python 2时某些Unicode细微差别的注释。

  

注意:在几乎所有情况下,textacy(以及spacy)都希望使用unicode文本数据。在整个代码中,该代码表示​​为str,与Python 3的默认字符串类型一致;但是,Python 2的用户必须谨记使用unicode,并根据需要从默认(字节)字符串类型转换。

因此,我要试一下(注意u'''):

content = u'''
          The apparent symmetry between the quark and lepton families of
          the Standard Model (SM) are, at the very least, suggestive of
          a more fundamental relationship between them. In some Beyond the
          Standard Model theories, such interactions are mediated by
          leptoquarks (LQs): hypothetical color-triplet bosons with both
          lepton and baryon number and fractional electric charge.'''

这产生了我所期望的Doc对象(尽管在Python 3上)。