Question

我有一个以下数据框df，我是从sframe

转换而来的

   URI                                            name           text
0  <http://dbpedia.org/resource/Digby_M...        Digby Morrell  digby morrell born 10 october 1979 i...
1  <http://dbpedia.org/resource/Alfred_...       Alfred J. Lewy  alfred j lewy aka sandy lewy graduat...
2  <http://dbpedia.org/resource/Harpdog...        Harpdog Brown  harpdog brown is a singer and harmon...
3  <http://dbpedia.org/resource/Franz_R...  Franz Rottensteiner  franz rottensteiner born in waidmann...
4  <http://dbpedia.org/resource/G-Enka>                  G-Enka  henry krvits born 30 december 1974 i...

我做了以下事情：

from textblob import TextBlob as tb

import math

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

bloblist = []

for i in range(0, df.shape[0]):
    bloblist.append(tb(df.iloc[i,2]))

for i, blob in enumerate(bloblist):
    print("Top words in document {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))

但这需要花费大量时间，因为有59000个文档。

有更好的方法吗？

Answer 1

我对这个问题很困惑。但我在互联网上找到了一些使用Spark的解决方案。在这里你可以看看：

https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
另一方面，我尝试了theese方法，但我没有得到不好的结果。也许你想尝试：
- 我有一个单词列表。此列表包含单词及其计数。
- 我发现这些词的平均值很重要。
- 我用平均值选择了下限和上限（例如，下限=平均值/ 2，上限=平均值* 5）
- 然后我创建了一个带有上限和下限的新单词列表。
随着我得到了结果：
在归一化之前，字向量长度：11880
平均值：19下限：9上限：95
归一化后字向量长度：1595
并且余弦相似度结果也更好。

Python：如何计算大型数据集的tf-idf

1 个答案: