我对Python NLP比较陌生,我正在尝试使用SpaCy处理CSV文件。我可以使用Pandas加载文件,但是当我尝试使用SpaCy的nlp函数处理它时,编译器会错误地通过文件内容的大约5%。
代码块如下:
import pandas as pd
df = pd.read_csv('./reviews.washington.dc.csv')
import spacy
nlp = spacy.load('en')
for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1, n_threads=4):
print (parsed_doc.text)
我也试过了:
df['parsed'] = df['comments'].apply(nlp)
具有相同的结果。
我收到的追溯是:
Traceback (most recent call last):
File "/Users/john/Downloads/spacy_load.py", line 11, in <module>
for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1,
n_threads=4):
File "/usr/local/lib/python3.6/site-packages/spacy/language.py",
line 352, in pipe for doc in stream:
File "spacy/syntax/parser.pyx", line 239, in pipe
(spacy/syntax/parser.cpp:8912)
File "spacy/matcher.pyx", line 465, in pipe (spacy/matcher.cpp:9904)
File "spacy/syntax/parser.pyx", line 239, in pipe (spacy/syntax/parser.cpp:8912)
File "spacy/tagger.pyx", line 231, in pipe (spacy/tagger.cpp:6548)
File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 345,
in <genexpr> stream = (self.make_doc(text) for text in texts)
File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 293,
in <lambda> self.make_doc = lambda text: self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got float)
任何人都可以解释为什么会发生这种情况,以及我如何解决这个问题?我已尝试从网站的各种变通方法无济于事。尝试/除了块也没有效果。
答案 0 :(得分:1)
我刚刚遇到与您收到的错误非常类似的错误。
>>> c.add_texts(df.DetailedDescription.astype('object'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\textacy\corpus.py", line 297, in add_texts
for i, spacy_doc in enumerate(spacy_docs):
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 554, in pipe
for doc in docs:
File "nn_parser.pyx", line 369, in pipe
File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
for item in self.iterseq:
File "nn_parser.pyx", line 369, in pipe
File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
for item in self.iterseq:
File "pipeline.pyx", line 395, in pipe
File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
for item in self.iterseq:
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 534, in <genexpr>
docs = (self.make_doc(text) for text in texts)
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 357, in make_doc
return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got float)
最后,我遇到了一个解决方案,即使用Pandas数据框将值转换为Unicode,然后将值作为本机数组检索并将其提供给Textacy add_texts
方法Corpus
。对象。
c = textacy.corpus.Corpus(lang='en_core_web_lg')
c.add_texts(df.DetailedDescription.astype('unicode').values)
df.DetailedDescription.astype('unicode').values
这样做允许我将所有文本添加到我的语料库中,尽管我们试图强制性地加载符合Unicode的文件(下面的代码片段以便帮助其他人)。
with codecs.open('Base Data\Base Data.csv', 'r', encoding='utf-8', errors='replace') as base_data:
df = pd.read_csv(StringIO(re.sub(r'(?!\n)[\x00-\x1F\x80-\xFF]', '', base_data.read())), dtype={"DetailedDescription":object, "OtherDescription":object}, na_values=[''])