使用pyspark Dataframe进行情感分析

时间：2019-07-02 10:46:40

标签： python-3.x elasticsearch pyspark apache-spark-sql sentiment-analysis

我正在尝试对收到的产品评论进行情感分析，数据集损坏了15列，而我们只能在1列上运行代码。我有正在生成情感值，极性，主观性并将数据发送到Elasticsearch Index的工作代码。但这对海量数据不可行。

因此，为了实现海量数据的实时性，我已经在RDD中进行了尝试，方法是转换为仅由1 col（“ review body”）组成的新rdd。在这里，无法与产品及其他产品之间可视化结果。创建udf后，当我尝试对DF m进行卡住时。

清除数据的功能

def clean_review(text):
    '''
    This fucntion is to clean the daata
    '''
    letters_only = re.sub("[^a-zA-Z]"," ", str(text)) # Remove all digits and punctuation
    words = str(letters_only).lower().split() # Convert it into lower case
    clean_words = [w for w in words if len(w) > 2] # Remove 1/2 letter words
    return(" ".join(clean_words))

识别停用词的功能

def clean_stop_words(review):
        words = str(review).lower().split() # split sentence into words
        stops = set(stopwords.words("english")) # import stop words from nltk
        stop_words = [w for w in words if w not in stops] # identify stop words from the sentence
        return(" ".join(stop_words))

在熊猫的df中，正在创建并填充新列，但在pyspark中，withcol func的使用方式不同，因此无法执行。

data.review_body['CleanedReview'] = data.review_body.apply(lambda x: clean_review(x))
data.review_body['clean_text'] = data.review_body['CleanedReview'].apply(lambda x: clean_stop_words(x))

卡住了如何将此线用于DF，并产生极性。

for x in range(len(data)):
    analysis = TextBlob(clean_review(str(data.review_body['clean_text'][x])))
    #print (analysis.sentiment)
    analysis.sentiment.polarity
    if analysis.sentiment.polarity < 0:
        sentiment = "negative"
        sentimentValue = -1
    elif analysis.sentiment.polarity == 0:
        sentiment = "neutral"
        sentimentValue = 0
    else:
        sentiment = "positive"
        sentimentValue = 1

    print(sentiment+ " "+ str(data.review_body['clean_text'][x]))

    #es.index(index="sent",
         #doc_type="test-type",
         #body={"marketplace": str(data["marketplace"][x]),
               "customer_id": str(data["customer_id"][x]),
               "review_id": str(data["review_id"][x])  .......n

现在尝试在RDD中尝试DF的原因只能将评论数据列发送到弹性搜索索引，此后就无法显示其他相关项的可视化风格。如果有什么可以帮助我实现这一目标，我将感到非常高兴。预期的输出用于Elasticsearch索引，其中将在上述代码的基础上创建并填充以下提及的字段。 ” {“ sentimental_score”：analysis.sentiment.polarity， “主观性”：analysis.sentiment.subjectivity， “情感”：情感， “ sentimentValue”：sentimentValue}）''

0 个答案:

没有答案