
时间:2019-03-12 03:43:51

标签: python scikit-learn nltk sklearn-pandas term-document-matrix


id                    text                                                                          
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t


import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())

#Removing Stop words
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words

起初我尝试使用函数 word_dist = nltk.FreqDist(df_tweetText ['text']),但最终会计算整个句子的值,而不是行中的每个单词。

我尝试过的另一件事是使用 df_tweetText ['text'] = df_tweetText ['text']。apply(word_tokenize)标记每个单词,然后再次调用 FeqDist 但这给了我一个错误,提示无法散列的类型:'list'

1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]


id                  |collusion | president |
1104159474368024599 |  1       |     0     |
1104155456019357703 |  0       |     2     |

编辑1:因此,我决定看一下textmining库,并重新创建了其中的一个示例。唯一的问题是,它会在每条Tweet的每一行中创建Term Document Matrix。

import textmining
#Creates Term Matrix 
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
#    print(df_tweetText['text'].to_string(index=False))

for row in tweetDocumentmatrix.rows(cutoff=1):


from sklearn.feature_extraction.text import CountVectorizer

corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

      00  007cigarjoe  08  10  100  1000  10000  100000  1000000  10000000  \
0      0            0   0   0    0     0      0       0        0         0   
1      0            0   0   0    0     0      0       0        0         0   
2      0            0   0   0    0     0      0       0        0         0  

1 个答案:

答案 0 :(得分:1)


import pandas as pd
from collections import Counter

# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]

# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
    df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())