我想创建一个LDA主题模型,并按照教程使用SpaCy这样做。我尝试使用spacy时收到的错误是我在google上找不到的错误,所以我希望有人在这里知道它是什么。
我在Anaconda上运行此代码:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
# deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
return texts_out
nlp = spacy.load('en', disable=['parser', 'ner'])
# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
我收到以下错误:
File "C:\Users\maart\AppData\Local\Continuum\anaconda3\lib\site-packages\_regex_core.py", line 1880, in get_firstset
raise _FirstSetError()
_FirstSetError
错误必须在词形还原后的某处发生,因为其他部分工作正常。
非常感谢!
答案 0 :(得分:0)
我也遇到了同样的问题,可以通过卸载regex(我安装了错误的版本)并再次运行class Point
{
public:
// Declare prefix and postfix increment operators.
Point& operator++(); // Prefix increment operator.
Point operator++(int); // Postfix increment operator.
private:
int x;
};
// Define prefix increment operator.
Point& Point::operator++()
{
x++;
return *this;
}
// Define postfix increment operator.
Point Point::operator++(int)
{
Point temp = *this;
++*this;
return temp;
}
来解决。这将重新安装正确版本的regex。