Python SpaCy创建nlp文档 - 参数'string'的类型不正确

时间:2017-08-18 04:10:56

标签: python nlp spacy

我对Python NLP比较陌生,我正在尝试使用SpaCy处理CSV文件。我可以使用Pandas加载文件,但是当我尝试使用SpaCy的nlp函数处理它时,编译器会错误地通过文件内容的大约5%。

代码块如下:

import pandas as pd
df = pd.read_csv('./reviews.washington.dc.csv')

import spacy
nlp = spacy.load('en')

for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1, n_threads=4):
    print (parsed_doc.text)

我也试过了:

df['parsed'] = df['comments'].apply(nlp)

具有相同的结果。

我收到的追溯是:

Traceback (most recent call last):
    File "/Users/john/Downloads/spacy_load.py", line 11, in <module>
        for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1,
        n_threads=4):
    File "/usr/local/lib/python3.6/site-packages/spacy/language.py",
        line 352, in pipe for doc in stream:
    File "spacy/syntax/parser.pyx", line 239, in pipe
        (spacy/syntax/parser.cpp:8912)
    File "spacy/matcher.pyx", line 465, in pipe (spacy/matcher.cpp:9904)
    File "spacy/syntax/parser.pyx", line 239, in pipe (spacy/syntax/parser.cpp:8912)
    File "spacy/tagger.pyx", line 231, in pipe (spacy/tagger.cpp:6548)
    File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 345,
        in <genexpr> stream = (self.make_doc(text) for text in texts)
    File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 293,
        in <lambda> self.make_doc = lambda text: self.tokenizer(text)
    TypeError: Argument 'string' has incorrect type (expected str, got float)

任何人都可以解释为什么会发生这种情况,以及我如何解决这个问题?我已尝试从网站的各种变通方法无济于事。尝试/除了块也没有效果。

1 个答案:

答案 0 :(得分:1)

我刚刚遇到与您收到的错误非常类似的错误。

>>> c.add_texts(df.DetailedDescription.astype('object'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\textacy\corpus.py", line 297, in add_texts
    for i, spacy_doc in enumerate(spacy_docs):
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 554, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 369, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
    for item in self.iterseq:
  File "nn_parser.pyx", line 369, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
    for item in self.iterseq:
  File "pipeline.pyx", line 395, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
    for item in self.iterseq:
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 534, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 357, in make_doc
    return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got float)

最后,我遇到了一个解决方案,即使用Pandas数据框将值转换为Unicode,然后将值作为本机数组检索并将其提供给Textacy add_texts方法Corpus。对象。

c = textacy.corpus.Corpus(lang='en_core_web_lg')
c.add_texts(df.DetailedDescription.astype('unicode').values)
df.DetailedDescription.astype('unicode').values

这样做允许我将所有文本添加到我的语料库中,尽管我们试图强制性地加载符合Unicode的文件(下面的代码片段以便帮助其他人)。

with codecs.open('Base Data\Base Data.csv', 'r', encoding='utf-8', errors='replace') as base_data:
  df = pd.read_csv(StringIO(re.sub(r'(?!\n)[\x00-\x1F\x80-\xFF]', '', base_data.read())), dtype={"DetailedDescription":object, "OtherDescription":object}, na_values=[''])