将临时分类特征转换为一定范围内的数值的最佳方法是什么?

时间:2018-08-14 06:54:09

标签: python pandas machine-learning nlp sentiment-analysis

我有一份调查表的清单,其中一些是在这里显示的:

Don't know / no opinion / blank
Need new products / policies Lower premiums
Pricing / rates are good / competitive
Problems / complaints should be responded to / resolved faster
Inform / up-date / explain about products / polices (existing new changes)
Other
Good communication / easy to reach / fast response / follow-up
Good overall customer service
Good variety of products / packages / benefits
Should be reachable / available / responsive
Trust / am confident / well established / long history
Improve clarity of information / communications (in general)
Not confident / not well known
Not knowledgeable enough about the product
Advisors should improve clarity of information / communications (in general)
Friends / family do not need - have own provider
Clear / easy to understand
Dont recommend - too risky / dont want to influence
Provide refunds if no claims were made
Improve online services / capabilities
Improve customer service / provide good overall service
Increase awareness / visibility / not well known
Should be more proactive in contacting / communicating with customers
Not knowledgeable enough about the company
Send statements / reports / notices more often
Improve the quality / variety of products offered
Dont recommend - dont discuss topic with others
Need more choices / variety / better selection of products / policies
Good amount of information / advice
Advisors should be more proactive in contacting / communicating with customers
Increase advertising / marketing / promotions
Should contact clients regularly / more often
Service okay / same as other providers
Pricing not competitive / services charges too high
Speed of processing claims
Fund performance / return on investment is good / stable
Good quality of online system / website
Advisors should inform / up-date / explain about products / polices (existing new changes)
Need better quality products / policies / better coverage

为了能够使用某种机器学习模型(例如xgboost不能处理分类变量)进行预测,我需要将它们转换为一定范围内的数值-例如0到5,其中0是Don't know / no opinion / blank的得分,休息时间根据陈述的积极性从1到5(1表示最不积极,5表示最积极)。

当然,一种方法是手动为所有唯一语句分配分数。不用说,这是非常繁琐且耗时的。

我想对每条陈述进行情感分析,然后将情感值映射到0到5范围内的得分值:

from textblob import TextBlob as tb
import pandas as pd
import numpy as np

data = pd.read_csv('Dataset_that_contains_these_statements.csv', encoding='ISO-8859-1') # contains a column called Statements that hold the statements above
doclist = data['Statements'].unique()

df_new = pd.DataFrame(index=np.arange(len(doclist)), columns=['Statement', 'Polarity', 'Subjectivity', 'Score'])

for i in range(len(doclist)):
    blob = tb(doclist[i])

    df_new['Statement'][i] = doclist[i]
    df_new['Polarity'][i] = blob.sentiment.polarity
    df_new['Subjectivity'][i] = blob.sentiment.subjectivity
    if blob.sentiment.polarity==0:
        df_new['Score'][i] = 0
    elif ((blob.sentiment.polarity!=0) & (blob.sentiment.polarity<0.1)):
        df_new['Score'][i] = 1
    elif ((blob.sentiment.polarity>=0.1) & (blob.sentiment.polarity<0.2)):
        df_new['Score'][i] = 2
    elif ((blob.sentiment.polarity>=0.2) & (blob.sentiment.polarity<0.3)):
        df_new['Score'][i] = 3
    elif ((blob.sentiment.polarity>=0.3) & (blob.sentiment.polarity<0.4)):
        df_new['Score'][i] = 4
    elif (blob.sentiment.polarity>=0.4):
        df_new['Score'][i] = 5

print(df_new)

这将产生以下输出:

    Statement                                               Polarity    Subjectivity    Score

0   Don't know / no opinion / blank                         0           0               0
1   Need new products / policies                            0.136364    0.454545        2
2   Lower premiums                                          0           0               0
3   Pricing / rates are good / competitive                  0.7         0.6             5
4   Problems / complaints should be responded to /...       0           0               0
5   Inform / up-date / explain about products / po...       0.136364    0.454545        2
6   Other                                                   -0.125      0.375           1
7   Good communication / easy to reach / fast resp...       0.444444    0.677778        5
8   Good overall customer service                           0.35        0.3             4
9   Good variety of products / packages / benefits          0.7         0.6             5
10  Should be reachable / available / responsive            0.4         0.4             5
11  Trust / am confident / well established...              0.225       0.616667        3
12  Improve clarity of information / communication...       0.05        0.5             0
13  Not confident in / not well known                      -0.25        0.833333        1
14  Not knowledgeable enough about the product              0           0.5             0
15  Advisors should improve clarity of information...       0.05        0.5             1
16  Friends / family do not need - have own provider        0.6         1               5
17  Clear / easy to understand                              0.266667    0.608333        3
18  Don't recommend - too risky / don't want to in...       0           0               0
19  Provide refunds if no claims were made                  0           0               0
20  Improve online services / capabilities                  0           0               0
21  Improve customer service / provide good overal...       0.35        0.3             2
22  Increase awareness / visibility /...                    0           0               0
23  Should be more proactive in contacting / commu...       0.5         0.5             5
24  Not knowledgeable enough about the company              0           0.5             0
25  Send statements / reports / notices more often          0.5         0.5             5
26  Improve the quality / variety of products offered       0           0               0
27  Don't recommend - don't discuss topic with others       0           0               0
28  Need more choices / variety / better selection...       0.5         0.5             5
29  Good amount of information / advice                     0.7         0.6             5
30  Advisors should be more proactive in contactin...       0.5         0.5             5
31  Increase advertising / marketing / pr...                0           0               0
32  Should contact clients regularly / more often           0.5         0.5             5
33  Service okay / same as other providers                  0.125       0.333333        2
34  Pricing not competitive / services charges too...       0.16        0.54            2
35  Speed of processing claims                              0           0               0
36  Fund performance / return on investment is goo...       0.7         0.6             5
37  Good quality of online system / website                 0.7         0.6             5
38  Advisors should inform / up-date / explain abo...       0.136364    0.454545        2
39  Need better quality products / policies / bett...       0.5         0.5             5

可以看出,结果并不可靠-例如,Problems / complaints should be responded to /...的极性为0,因此得分为0,而它应该是负极性陈述,得分应该为1或2。另一个例子是,Should be more proactive in contacting / commu...被评为高度肯定的陈述,因此得到5分,如果为否定陈述,应该得到2分,依此类推。

那么,我该如何自动实现这种编码?

PS :我在语句列上使用了一种热编码,为每个唯一语句添加了新功能,然后为它们分配了0或1的值。但是,这导致功能的数量成倍增加,并且缺少可视性和可理解性的某些可视化效果。这就是为什么我要执行以上操作。有什么想法吗?

0 个答案:

没有答案