Question

我的熊猫数据框如下。

thi        0.969378
text       0.969378
is         0.969378
anoth      0.699030
your       0.497120
first      0.497120
book       0.497120
third      0.445149
the        0.445149
for        0.445149
analysi    0.445149

我想将其转换为元组列表，如下所示。

[["this", 0.969378], ["text", 0.969378], ..., ["analysi", 0.445149]]

我的代码如下。

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer

def tokenize(text):
    tokens = word_tokenize(text)
    stems = []
    for item in tokens: stems.append(PorterStemmer().stem(item))
    return stems

# your corpus
text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
# word tokenize and stem
text = [" ".join(tokenize(txt.lower())) for txt in text]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(text).todense()
# transform the matrix to a pandas df
matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
# sum over each document (axis=0)
top_words = matrix.sum(axis=0).sort_values(ascending=False)
print(top_words)

我尝试了以下两个选项。

list(zip(*map(top_words.get, top_words)))

我得到的错误为TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0.9693779251346359] of <class 'float'>

list(top_words.itertuples(index=True))

我收到了错误消息AttributeError: 'Series' object has no attribute 'itertuples'。

请让我知道在熊猫中进行此操作的快速方法。

如果需要，我很乐意提供更多详细信息。

Answer 1

将zip的索引与映射元组一起使用到列表：

a = list(map(list,zip(top_words.index,top_words)))

或将索引转换为列，转换为nupy数组，然后转换为列表：

a = top_words.reset_index().to_numpy().tolist()

print (a)
[['thi', 0.9693780000000001], ['text', 0.9693780000000001], 
 ['is', 0.9693780000000001], ['anoth', 0.69903], 
 ['your', 0.49712], ['first', 0.49712], ['book', 0.49712],
 ['third', 0.44514899999999996], ['the', 0.44514899999999996],
 ['for', 0.44514899999999996], ['analysi', 0.44514899999999996]]

如何快速将Pandas数据框转换为元组列表

1 个答案: