我在推文和它们的类列表上进行Tensorflow分类,问题是在将推文分成单词然后使用TF-IDF对其进行矢量化之后,单词的长度大于类的长度。
(DataFrame"示例"从CSV导入):
Class Tweet
0 1 ضميان قرب شفتك سيد الخود اخاف اموت فراق ما ابت...
1 5 بعد مرور اسبوع عاد صاحب المزرعه ليقول للديك : ...
2 1 انا لو ابتل على الطبخ والموالح ابرك لي من الحل...
3 1 انا اكثر انسان يصلح يقدم محاضرات عن "كيف تيأس ...
4 1 الاغنيه تخلص بس لمن اغنيها انا لا، ابتل اعيد و...
5 1 اللهم أهدني سُقيا من سمائك أبتل بها ولا أزل.
(将单词转换为TF-IDF代码):
mess = "
def text_cleaning(mess):
delpunc = [c for c in mess if c not in string.punctuation]
delpunc = ''.join(delpunc)
return [word for word in delpunc.split() if word.lower() not in
stopwords]
# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])
# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)
如果我print(tweet_tfidf)
输出:
对输出进行分类:
(推文ID,Word ID)单词重量
(0, 141) 0.35476981351536396
(0, 91) 0.3015867532506004
(0, 84) 0.3015867532506004
(0, 82) 0.3015867532506004
(0, 77) 0.35476981351536396
(0, 76) 0.3015867532506004
(0, 69) 0.3015867532506004
(0, 36) 0.3015867532506004
(0, 25) 0.3015867532506004
(0, 11) 0.3015867532506004
(0, 5) 0.14366697931897693
(1, 142) 0.335452510590434
(1, 129) 0.335452510590434
(1, 125) 0.335452510590434
(1, 103) 0.2851652809360297
(1, 42) 0.335452510590434
(1, 41) 0.335452510590434
(1, 18) 0.335452510590434
(1, 14) 0.335452510590434
(1, 6) 0.335452510590434
(1, 5) 0.13584427723416684
(2, 119) 0.2504289625926897
(2, 118) 0.2504289625926897
(2, 117) 0.2504289625926897
(2, 93) 0.2504289625926897
: :
(8, 62) 0.1770906272241602
(8, 55) 0.3541812544483204
(8, 51) 0.3541812544483204
(8, 48) 0.1770906272241602
(8, 43) 0.1770906272241602
(8, 40) 0.1770906272241602
(8, 39) 0.1770906272241602
(8, 37) 0.1770906272241602
(8, 35) 0.1770906272241602
(8, 32) 0.1770906272241602
(8, 24) 0.1770906272241602
(8, 21) 0.1770906272241602
(8, 5) 0.07171431872090847
(9, 123) 0.29928865657458936
(9, 114) 0.29928865657458936
(9, 105) 0.29928865657458936
(9, 100) 0.29928865657458936
(9, 89) 0.29928865657458936
(9, 59) 0.29928865657458936
(9, 49) 0.29928865657458936
(9, 20) 0.29928865657458936
(9, 17) 0.29928865657458936
(9, 15) 0.29928865657458936
(9, 10) 0.29928865657458936
(9, 5) 0.12119942451824135
type(tweet_tfidf)
是:
scipy.sparse.csr.csr_matrix
在 tensorflow 中,您应该有培训课程和培训课程 ..我有培训课程,而且我没有参加培训课程。 我希望有一个DataFrame,其中Word Weight与正确的类相关联,例如:
(推文ID,Word ID)...单词重量...类
(0, 141) 0.35476981351536396 1
(0, 91) 0.3015867532506004 1
(0, 84) 0.3015867532506004 1
(0, 82) 0.3015867532506004 1
(0, 77) 0.35476981351536396 1
(0, 76) 0.3015867532506004 1
(0, 69) 0.3015867532506004 1
(0, 36) 0.3015867532506004 1
(0, 25) 0.3015867532506004 1
(0, 11) 0.3015867532506004 1
(0, 5) 0.14366697931897693 1
(1, 142) 0.335452510590434 5
(1, 129) 0.335452510590434 5
(1, 125) 0.335452510590434 5
(1, 103) 0.2851652809360297 5
(1, 42) 0.335452510590434 5
(1, 41) 0.335452510590434 5
(1, 18) 0.335452510590434 5
(1, 14) 0.335452510590434 5
(1, 6) 0.335452510590434 5
(1, 5) 0.13584427723416684 5
答案 0 :(得分:1)
这需要一点操作。你需要 -
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import string
import numpy as np
tweet = pd.read_csv('sample.csv', encoding="ISO-8859-1")
mess = ''
stopwords = []
def text_cleaning(mess):
delpunc = [c for c in mess if c not in string.punctuation]
delpunc = ''.join(delpunc)
return [word for word in delpunc.split() if word.lower() not in
stopwords]
# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])
# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)
ind_mapping = dict(zip(tweet.index, tweet.Class))
print(ind_mapping)
import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))
<强>输出强>
row_index column_index tf_idf class
0 5 0 0.339570 1
1 5 1 0.339570 1
2 5 2 0.339570 1
3 0 3 0.333333 1
4 2 4 0.283865 1
5 4 4 0.268247 1
6 2 5 0.346171 1
7 0 6 0.333333 1
8 1 7 0.353553 5
9 4 8 0.327125 1
10 4 9 0.327125 1
11 3 10 0.339570 1
12 4 11 0.327125 1
13 2 12 0.346171 1
14 0 13 0.333333 1
15 2 14 0.346171 1
16 5 15 0.339570 1
17 1 16 0.353553 5
18 0 17 0.333333 1
19 3 18 0.278453 1
20 4 18 0.268247 1
21 3 19 0.339570 1
22 4 20 0.327125 1
23 1 21 0.353553 5
24 5 22 0.339570 1
25 4 23 0.327125 1
26 3 24 0.339570 1
27 5 25 0.339570 1
28 0 26 0.333333 1
29 5 27 0.339570 1
30 0 28 0.333333 1
31 1 29 0.353553 5
32 1 30 0.353553 5
33 2 31 0.346171 1
34 3 32 0.339570 1
35 0 33 0.333333 1
36 0 34 0.333333 1
37 3 35 0.339570 1
38 4 36 0.327125 1
39 1 37 0.353553 5
40 4 38 0.327125 1
41 2 39 0.346171 1
42 2 40 0.346171 1
43 1 41 0.353553 5
44 0 42 0.333333 1
45 3 43 0.339570 1
46 1 44 0.353553 5
47 2 45 0.283865 1
48 5 45 0.278453 1
49 4 46 0.327125 1
50 2 47 0.346171 1
51 5 48 0.339570 1
52 3 49 0.339570 1
53 3 50 0.339570 1
<强>解释强>
创建索引和类的映射 -
ind_mapping = dict(zip(tweet.index, tweet.Class))
获取row_index
,column_index
和tf_idf
值 -
import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
转换为dataframe
值和映射 -
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))