我有一份调查表的清单,其中一些是在这里显示的:
Don't know / no opinion / blank
Need new products / policies Lower premiums
Pricing / rates are good / competitive
Problems / complaints should be responded to / resolved faster
Inform / up-date / explain about products / polices (existing new changes)
Other
Good communication / easy to reach / fast response / follow-up
Good overall customer service
Good variety of products / packages / benefits
Should be reachable / available / responsive
Trust / am confident / well established / long history
Improve clarity of information / communications (in general)
Not confident / not well known
Not knowledgeable enough about the product
Advisors should improve clarity of information / communications (in general)
Friends / family do not need - have own provider
Clear / easy to understand
Dont recommend - too risky / dont want to influence
Provide refunds if no claims were made
Improve online services / capabilities
Improve customer service / provide good overall service
Increase awareness / visibility / not well known
Should be more proactive in contacting / communicating with customers
Not knowledgeable enough about the company
Send statements / reports / notices more often
Improve the quality / variety of products offered
Dont recommend - dont discuss topic with others
Need more choices / variety / better selection of products / policies
Good amount of information / advice
Advisors should be more proactive in contacting / communicating with customers
Increase advertising / marketing / promotions
Should contact clients regularly / more often
Service okay / same as other providers
Pricing not competitive / services charges too high
Speed of processing claims
Fund performance / return on investment is good / stable
Good quality of online system / website
Advisors should inform / up-date / explain about products / polices (existing new changes)
Need better quality products / policies / better coverage
为了能够使用某种机器学习模型(例如xgboost不能处理分类变量)进行预测,我需要将它们转换为一定范围内的数值-例如0到5,其中0是Don't know / no opinion / blank
的得分,休息时间根据陈述的积极性从1到5(1表示最不积极,5表示最积极)。
当然,一种方法是手动为所有唯一语句分配分数。不用说,这是非常繁琐且耗时的。
我想对每条陈述进行情感分析,然后将情感值映射到0到5范围内的得分值:
from textblob import TextBlob as tb
import pandas as pd
import numpy as np
data = pd.read_csv('Dataset_that_contains_these_statements.csv', encoding='ISO-8859-1') # contains a column called Statements that hold the statements above
doclist = data['Statements'].unique()
df_new = pd.DataFrame(index=np.arange(len(doclist)), columns=['Statement', 'Polarity', 'Subjectivity', 'Score'])
for i in range(len(doclist)):
blob = tb(doclist[i])
df_new['Statement'][i] = doclist[i]
df_new['Polarity'][i] = blob.sentiment.polarity
df_new['Subjectivity'][i] = blob.sentiment.subjectivity
if blob.sentiment.polarity==0:
df_new['Score'][i] = 0
elif ((blob.sentiment.polarity!=0) & (blob.sentiment.polarity<0.1)):
df_new['Score'][i] = 1
elif ((blob.sentiment.polarity>=0.1) & (blob.sentiment.polarity<0.2)):
df_new['Score'][i] = 2
elif ((blob.sentiment.polarity>=0.2) & (blob.sentiment.polarity<0.3)):
df_new['Score'][i] = 3
elif ((blob.sentiment.polarity>=0.3) & (blob.sentiment.polarity<0.4)):
df_new['Score'][i] = 4
elif (blob.sentiment.polarity>=0.4):
df_new['Score'][i] = 5
print(df_new)
这将产生以下输出:
Statement Polarity Subjectivity Score
0 Don't know / no opinion / blank 0 0 0
1 Need new products / policies 0.136364 0.454545 2
2 Lower premiums 0 0 0
3 Pricing / rates are good / competitive 0.7 0.6 5
4 Problems / complaints should be responded to /... 0 0 0
5 Inform / up-date / explain about products / po... 0.136364 0.454545 2
6 Other -0.125 0.375 1
7 Good communication / easy to reach / fast resp... 0.444444 0.677778 5
8 Good overall customer service 0.35 0.3 4
9 Good variety of products / packages / benefits 0.7 0.6 5
10 Should be reachable / available / responsive 0.4 0.4 5
11 Trust / am confident / well established... 0.225 0.616667 3
12 Improve clarity of information / communication... 0.05 0.5 0
13 Not confident in / not well known -0.25 0.833333 1
14 Not knowledgeable enough about the product 0 0.5 0
15 Advisors should improve clarity of information... 0.05 0.5 1
16 Friends / family do not need - have own provider 0.6 1 5
17 Clear / easy to understand 0.266667 0.608333 3
18 Don't recommend - too risky / don't want to in... 0 0 0
19 Provide refunds if no claims were made 0 0 0
20 Improve online services / capabilities 0 0 0
21 Improve customer service / provide good overal... 0.35 0.3 2
22 Increase awareness / visibility /... 0 0 0
23 Should be more proactive in contacting / commu... 0.5 0.5 5
24 Not knowledgeable enough about the company 0 0.5 0
25 Send statements / reports / notices more often 0.5 0.5 5
26 Improve the quality / variety of products offered 0 0 0
27 Don't recommend - don't discuss topic with others 0 0 0
28 Need more choices / variety / better selection... 0.5 0.5 5
29 Good amount of information / advice 0.7 0.6 5
30 Advisors should be more proactive in contactin... 0.5 0.5 5
31 Increase advertising / marketing / pr... 0 0 0
32 Should contact clients regularly / more often 0.5 0.5 5
33 Service okay / same as other providers 0.125 0.333333 2
34 Pricing not competitive / services charges too... 0.16 0.54 2
35 Speed of processing claims 0 0 0
36 Fund performance / return on investment is goo... 0.7 0.6 5
37 Good quality of online system / website 0.7 0.6 5
38 Advisors should inform / up-date / explain abo... 0.136364 0.454545 2
39 Need better quality products / policies / bett... 0.5 0.5 5
可以看出,结果并不可靠-例如,Problems / complaints should be responded to /...
的极性为0,因此得分为0,而它应该是负极性陈述,得分应该为1或2。另一个例子是,Should be more proactive in contacting / commu...
被评为高度肯定的陈述,因此得到5分,如果为否定陈述,应该得到2分,依此类推。
那么,我该如何自动实现这种编码?
PS :我在语句列上使用了一种热编码,为每个唯一语句添加了新功能,然后为它们分配了0或1的值。但是,这导致功能的数量成倍增加,并且缺少可视性和可理解性的某些可视化效果。这就是为什么我要执行以上操作。有什么想法吗?