Question

我在推文和它们的类列表上进行Tensorflow分类，问题是在将推文分成单词然后使用TF-IDF对其进行矢量化之后，单词的长度大于类的长度。

（DataFrame＆＃34;示例＆＃34;从CSV导入）：

   Class                 Tweet
0   1   ضميان قرب شفتك سيد الخود اخاف اموت فراق ما ابت...
1   5   بعد مرور اسبوع عاد صاحب المزرعه ليقول للديك : ...
2   1   انا لو ابتل على الطبخ والموالح ابرك لي من الحل...
3   1   انا اكثر انسان يصلح يقدم محاضرات عن "كيف تيأس ...
4   1   الاغنيه تخلص بس لمن اغنيها انا لا، ابتل اعيد و...
5   1   اللهم أهدني سُقيا من سمائك أبتل بها ولا أزل.

（将单词转换为TF-IDF代码）：

mess = "

def text_cleaning(mess):
    delpunc = [c for c in mess if c not in string.punctuation]
    delpunc = ''.join(delpunc)
    return [word for word in delpunc.split() if word.lower() not in 
    stopwords]

# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])

# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)

如果我print(tweet_tfidf)输出：

对输出进行分类：

（推文ID，Word ID）单词重量

  (0, 141)  0.35476981351536396      
  (0, 91)   0.3015867532506004       
  (0, 84)   0.3015867532506004       
  (0, 82)   0.3015867532506004       
  (0, 77)   0.35476981351536396      
  (0, 76)   0.3015867532506004       
  (0, 69)   0.3015867532506004       
  (0, 36)   0.3015867532506004       
  (0, 25)   0.3015867532506004       
  (0, 11)   0.3015867532506004      
  (0, 5)    0.14366697931897693      
  (1, 142)  0.335452510590434        
  (1, 129)  0.335452510590434        
  (1, 125)  0.335452510590434       
  (1, 103)  0.2851652809360297       
  (1, 42)   0.335452510590434        
  (1, 41)   0.335452510590434        
  (1, 18)   0.335452510590434        
  (1, 14)   0.335452510590434        
  (1, 6)    0.335452510590434        
  (1, 5)    0.13584427723416684      
  (2, 119)  0.2504289625926897
  (2, 118)  0.2504289625926897
  (2, 117)  0.2504289625926897
  (2, 93)   0.2504289625926897
  : :
  (8, 62)   0.1770906272241602
  (8, 55)   0.3541812544483204
  (8, 51)   0.3541812544483204
  (8, 48)   0.1770906272241602
  (8, 43)   0.1770906272241602
  (8, 40)   0.1770906272241602
  (8, 39)   0.1770906272241602
  (8, 37)   0.1770906272241602
  (8, 35)   0.1770906272241602
  (8, 32)   0.1770906272241602
  (8, 24)   0.1770906272241602
  (8, 21)   0.1770906272241602
  (8, 5)    0.07171431872090847
  (9, 123)  0.29928865657458936
  (9, 114)  0.29928865657458936
  (9, 105)  0.29928865657458936
  (9, 100)  0.29928865657458936
  (9, 89)   0.29928865657458936
  (9, 59)   0.29928865657458936
  (9, 49)   0.29928865657458936
  (9, 20)   0.29928865657458936
  (9, 17)   0.29928865657458936
  (9, 15)   0.29928865657458936
  (9, 10)   0.29928865657458936
  (9, 5)    0.12119942451824135

type(tweet_tfidf)是：

scipy.sparse.csr.csr_matrix

在 tensorflow 中，您应该有培训课程和培训课程 ..我有培训课程，而且我没有参加培训课程。我希望有一个DataFrame，其中Word Weight与正确的类相关联，例如：

（推文ID，Word ID）...单词重量...类

  (0, 141)  0.35476981351536396      1
  (0, 91)   0.3015867532506004       1
  (0, 84)   0.3015867532506004       1
  (0, 82)   0.3015867532506004       1
  (0, 77)   0.35476981351536396      1
  (0, 76)   0.3015867532506004       1
  (0, 69)   0.3015867532506004       1
  (0, 36)   0.3015867532506004       1
  (0, 25)   0.3015867532506004       1
  (0, 11)   0.3015867532506004       1
  (0, 5)    0.14366697931897693      1
  (1, 142)  0.335452510590434        5
  (1, 129)  0.335452510590434        5
  (1, 125)  0.335452510590434        5
  (1, 103)  0.2851652809360297       5
  (1, 42)   0.335452510590434        5
  (1, 41)   0.335452510590434        5
  (1, 18)   0.335452510590434        5
  (1, 14)   0.335452510590434        5
  (1, 6)    0.335452510590434        5
  (1, 5)    0.13584427723416684      5

Answer 1

这需要一点操作。你需要 -

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import string
import numpy as np

tweet = pd.read_csv('sample.csv', encoding="ISO-8859-1")
mess = ''
stopwords = []

def text_cleaning(mess):
    delpunc = [c for c in mess if c not in string.punctuation]
    delpunc = ''.join(delpunc)
    return [word for word in delpunc.split() if word.lower() not in 
    stopwords]

# ==== Vectorization TF ====
bagow_transformer = CountVectorizer(analyzer=text_cleaning).fit(tweet['Tweet'][:10])
tweet_bagow = bagow_transformer.transform(tweet['Tweet'][:10])

# ==== Vectorization TF-IDF =====
tfidf_transformer = TfidfTransformer().fit(tweet_bagow)
tweet_tfidf = tfidf_transformer.transform(tweet_bagow)

ind_mapping = dict(zip(tweet.index, tweet.Class))
print(ind_mapping)

import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)
print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))

<强>输出

     row_index  column_index    tf_idf  class
0           5             0  0.339570      1
1           5             1  0.339570      1
2           5             2  0.339570      1
3           0             3  0.333333      1
4           2             4  0.283865      1
5           4             4  0.268247      1
6           2             5  0.346171      1
7           0             6  0.333333      1
8           1             7  0.353553      5
9           4             8  0.327125      1
10          4             9  0.327125      1
11          3            10  0.339570      1
12          4            11  0.327125      1
13          2            12  0.346171      1
14          0            13  0.333333      1
15          2            14  0.346171      1
16          5            15  0.339570      1
17          1            16  0.353553      5
18          0            17  0.333333      1
19          3            18  0.278453      1
20          4            18  0.268247      1
21          3            19  0.339570      1
22          4            20  0.327125      1
23          1            21  0.353553      5
24          5            22  0.339570      1
25          4            23  0.327125      1
26          3            24  0.339570      1
27          5            25  0.339570      1
28          0            26  0.333333      1
29          5            27  0.339570      1
30          0            28  0.333333      1
31          1            29  0.353553      5
32          1            30  0.353553      5
33          2            31  0.346171      1
34          3            32  0.339570      1
35          0            33  0.333333      1
36          0            34  0.333333      1
37          3            35  0.339570      1
38          4            36  0.327125      1
39          1            37  0.353553      5
40          4            38  0.327125      1
41          2            39  0.346171      1
42          2            40  0.346171      1
43          1            41  0.353553      5
44          0            42  0.333333      1
45          3            43  0.339570      1
46          1            44  0.353553      5
47          2            45  0.283865      1
48          5            45  0.278453      1
49          4            46  0.327125      1
50          2            47  0.346171      1
51          5            48  0.339570      1
52          3            49  0.339570      1
53          3            50  0.339570      1

<强>解释

创建索引和类的映射 -

ind_mapping = dict(zip(tweet.index, tweet.Class))

获取row_index，column_index和tf_idf值 -

import scipy
I, J, V = scipy.sparse.find(tweet_tfidf)

转换为dataframe值和映射 -

print(pd.DataFrame([ [i,j,v,ind_mapping[i]] for i,j,v in zip(I,J,V)], columns=['row_index', 'column_index', 'tf_idf', 'class']))

将班级与矢量化词

1 个答案: