我的熊猫数据框如下。
thi 0.969378
text 0.969378
is 0.969378
anoth 0.699030
your 0.497120
first 0.497120
book 0.497120
third 0.445149
the 0.445149
for 0.445149
analysi 0.445149
我想将其转换为元组列表,如下所示。
[["this", 0.969378], ["text", 0.969378], ..., ["analysi", 0.445149]]
我的代码如下。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
def tokenize(text):
tokens = word_tokenize(text)
stems = []
for item in tokens: stems.append(PorterStemmer().stem(item))
return stems
# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
print(top_words)
我尝试了以下两个选项。
list(zip(*map(top_words.get, top_words)))
我得到的错误为TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.9693779251346359] of <class 'float'>
list(top_words.itertuples(index=True))
我收到了错误消息AttributeError: 'Series' object has no attribute 'itertuples'
。
请让我知道在熊猫中进行此操作的快速方法。
如果需要,我很乐意提供更多详细信息。
答案 0 :(得分:1)
将zip
的索引与映射元组一起使用到列表:
a = list(map(list,zip(top_words.index,top_words)))
或将索引转换为列,转换为nupy数组,然后转换为列表:
a = top_words.reset_index().to_numpy().tolist()
print (a)
[['thi', 0.9693780000000001], ['text', 0.9693780000000001],
['is', 0.9693780000000001], ['anoth', 0.69903],
['your', 0.49712], ['first', 0.49712], ['book', 0.49712],
['third', 0.44514899999999996], ['the', 0.44514899999999996],
['for', 0.44514899999999996], ['analysi', 0.44514899999999996]]