我必须对某些情绪进行分类,我的数据框就像这样
Phrase Sentiment
is it good movie positive
wooow is it very goode positive
bad movie negative
我做了一些预处理,因为标记化停止词语等等......我得到了
Phrase Sentiment
[ good , movie ] positive
[wooow ,is , it ,very, good ] positive
[bad , movie ] negative
我需要最终得到一个数据帧,该行是文本,其值是tf_idf,列是像这样的单词
good movie wooow very bad Sentiment
tf idf tfidf_ tfidf tf_idf tf_idf positive
(其余两行同样如此)
答案 0 :(得分:6)
我使用sklearn.feature_extraction.text.TfidfVectorizer,专为此类任务而设计:
<强>演示:强>
In [63]: df
Out[63]:
Phrase Sentiment
0 is it good movie positive
1 wooow is it very goode positive
2 bad movie negative
解决方案:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
r = df[['Sentiment']].copy()
del df
df = pd.DataFrame(X, columns=vect.get_feature_names())
del X
del vect
r.join(df)
结果:
In [31]: r.join(df)
Out[31]:
Sentiment bad good goode wooow
0 positive 0.0 1.0 0.000000 0.000000
1 positive 0.0 0.0 0.707107 0.707107
2 negative 1.0 0.0 0.000000 0.000000
更新:内存保存解决方案:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
for i, col in enumerate(vect.get_feature_names()):
df[col] = X[:, i]
UPDATE2: related question where the memory issue was finally solved
答案 1 :(得分:5)
设置
df = pd.DataFrame([
[['good', 'movie'], 'positive'],
[['wooow', 'is', 'it', 'very', 'good'], 'positive'],
[['bad', 'movie'], 'negative']
], columns=['Phrase', 'Sentiment'])
df
Phrase Sentiment
0 [good, movie] positive
1 [wooow, is, it, very, good] positive
2 [bad, movie] negative
# use `value_counts` to get counts of items in list
tf = df.Phrase.apply(pd.value_counts).fillna(0)
print(tf)
bad good is it movie very wooow
0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
1 0.0 1.0 1.0 1.0 0.0 1.0 1.0
2 1.0 0.0 0.0 0.0 1.0 0.0 0.0
计算inverse document frequency idf
# add one to numerator and denominator just incase a term isn't in any document
# maximum value is log(N) and minimum value is zero
idf = np.log((len(df) + 1 ) / (tf.gt(0).sum() + 1))
idf
bad 0.693147
good 0.287682
is 0.693147
it 0.693147
movie 0.287682
very 0.693147
wooow 0.693147
dtype: float64
tfidf
tdf * idf
bad good is it movie very wooow
0 0.000000 0.287682 0.000000 0.000000 0.287682 0.000000 0.000000
1 0.000000 0.287682 0.693147 0.693147 0.000000 0.693147 0.693147
2 0.693147 0.000000 0.000000 0.000000 0.287682 0.000000 0.000000