我正在尝试对收到的产品评论进行情感分析,数据集损坏了15列,而我们只能在1列上运行代码。我有正在生成情感值,极性,主观性并将数据发送到Elasticsearch Index的工作代码。但这对海量数据不可行。
因此,为了实现海量数据的实时性,我已经在RDD中进行了尝试,方法是转换为仅由1 col(“ review body”)组成的新rdd。在这里,无法与产品及其他产品之间可视化结果。 创建udf后,当我尝试对DF m进行卡住时。
def clean_review(text):
'''
This fucntion is to clean the daata
'''
letters_only = re.sub("[^a-zA-Z]"," ", str(text)) # Remove all digits and punctuation
words = str(letters_only).lower().split() # Convert it into lower case
clean_words = [w for w in words if len(w) > 2] # Remove 1/2 letter words
return(" ".join(clean_words))
def clean_stop_words(review):
words = str(review).lower().split() # split sentence into words
stops = set(stopwords.words("english")) # import stop words from nltk
stop_words = [w for w in words if w not in stops] # identify stop words from the sentence
return(" ".join(stop_words))
data.review_body['CleanedReview'] = data.review_body.apply(lambda x: clean_review(x))
data.review_body['clean_text'] = data.review_body['CleanedReview'].apply(lambda x: clean_stop_words(x))
for x in range(len(data)):
analysis = TextBlob(clean_review(str(data.review_body['clean_text'][x])))
#print (analysis.sentiment)
analysis.sentiment.polarity
if analysis.sentiment.polarity < 0:
sentiment = "negative"
sentimentValue = -1
elif analysis.sentiment.polarity == 0:
sentiment = "neutral"
sentimentValue = 0
else:
sentiment = "positive"
sentimentValue = 1
print(sentiment+ " "+ str(data.review_body['clean_text'][x]))
#es.index(index="sent",
#doc_type="test-type",
#body={"marketplace": str(data["marketplace"][x]),
"customer_id": str(data["customer_id"][x]),
"review_id": str(data["review_id"][x]) .......n
现在尝试在RDD中尝试DF的原因只能将评论数据列发送到弹性搜索索引,此后就无法显示其他相关项的可视化风格。 如果有什么可以帮助我实现这一目标,我将感到非常高兴。 预期的输出用于Elasticsearch索引,其中将在上述代码的基础上创建并填充以下提及的字段。 ” {“ sentimental_score”:analysis.sentiment.polarity, “主观性”:analysis.sentiment.subjectivity, “情感”:情感, “ sentimentValue”:sentimentValue})''