从Python数据框创建词频矩阵

时间:2019-03-12 03:43:51

标签: python scikit-learn nltk sklearn-pandas term-document-matrix

我正在对某些Twitter数据进行一些自然语言处理。因此,我设法成功加载并清理了一些推文,并将其放入下面的数据框中。

id                    text                                                                          
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t

问题是我正在尝试构建术语频率矩阵,其中每一行是一条推文,每一列是该单词在特定行中出现的值。我唯一的问题是其他帖子提到了术语频率分布文本文件。这是我用来生成上面数据帧的代码

import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())

#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText

起初我尝试使用函数 word_dist = nltk.FreqDist(df_tweetText ['text']),但最终会计算整个句子的值,而不是行中的每个单词。

我尝试过的另一件事是使用 df_tweetText ['text'] = df_tweetText ['text']。apply(word_tokenize)标记每个单词,然后再次调用 FeqDist 但这给了我一个错误,提示无法散列的类型:'list'

1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]

是否有其他方法可以构建此频率矩阵?理想情况下,我希望我的数据看起来像这样

id                  |collusion | president |
------------------------------------------ 
1104159474368024599 |  1       |     0     |
1104155456019357703 |  0       |     2     |

编辑1:因此,我决定看一下textmining库,并重新创建了其中的一个示例。唯一的问题是,它会在每条Tweet的每一行中创建Term Document Matrix。

import textmining
#Creates Term Matrix 
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
    tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
#    print(df_tweetText['text'].to_string(index=False))

for row in tweetDocumentmatrix.rows(cutoff=1):
    print(row)

EDIT2:所以我尝试了SKlearn,但是这种方法行得通,但是问题是我在我的列中找到了不应该存在的中文/日语字符。另外由于某些原因,我的专栏也显示为数字

from sklearn.feature_extraction.text import CountVectorizer

corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

      00  007cigarjoe  08  10  100  1000  10000  100000  1000000  10000000  \
0      0            0   0   0    0     0      0       0        0         0   
1      0            0   0   0    0     0      0       0        0         0   
2      0            0   0   0    0     0      0       0        0         0  

1 个答案:

答案 0 :(得分:1)

通过遍历每一行可能不是最佳的,但是可行。里程可能会根据推文的持续时间和正在处理的推文的数量而有所不同。

import pandas as pd
from collections import Counter

# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]

# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
    df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())