我有一个包含三列的数据框df,如下所示:
DocumentID Words Region
1 ['A','B','C'] ['Canada']
2 ['A','X','D'] ['India', 'USA', 'Canada']
3 ['B','C','X'] ['Canada']
我想计算" Words"中每个单词的IDF。列,即我想要生成一个输出,其中每个单词都有' A',' B'' C'等及其相应的IDF值。
答案 0 :(得分:0)
这是一个稍微不那么具体的版本。假设您需要IDF的标准1 / df定义,您可以遍历每个"文档"在Words
列计数:
from collections import defaultdict
# Assuming the Words column is represented as you presented it:
words = [['A','B','C'],
['A','X','D'],
['B','C','X']]
# to store intermediate counts:
idf = defaultdict(float)
for doc in words:
for w in doc:
idf[w] += 1
# Compute IDF as 1/df :
idf = {k:(1/v) for (k,v) in idf.items()} #<- {'A': 0.5, 'B': 0.5,'C': 0.5, 'D': 1.0, 'X': 0.5}
vocab = idf.keys() # Note that the vocab is also accessible now.
答案 1 :(得分:-1)
list_words = []
list_regions = []
for words in df['Words']:
for word in words:
list_words.append(word)
for regions in df['Region']:
for region in regions:
list_regions.append(region)
IDF_words = pd.DataFrame([], columns=['words','IDF'])
IDF_regions = pd.DataFrame([], columns=['regions','IDF'])
IDF_words['words'] = sorted(set(list_words))
IDF_regions['regions'] = sorted(set(list_regions))
IDF_words['IDF'] = IDF_words['words'].map(lambda x: list_words.count(x)/float(len(list_words)))
IDF_regions['IDF'] = IDF_regions['regions'].map(lambda x: list_regions.count(x)/float(len(list_regions)))
希望它有助于兄弟!
如果它确实请upvote / mark回答:)
和平