我有一个熊猫数据框,我想基于一个文本列进行2克频率的显示。
text_column
This is a book
This is a book that is read
This is a book but he doesn't think this is a book
最终结果是频率计数为2克,但频率是对每个文档中是否存在2克而不是2克进行计数。
因此部分结果将是
2 gram Count
This is 3
a book 3
尽管这3个文本各有2个,但“ 3个文本”中都出现了“这是”和“一本书”,因为我只想知道这2克出现了多少文档,所以计数是3,所以4。
知道我该怎么做吗?
谢谢
答案 0 :(得分:2)
Python式答案(写得很笼统,因此可以应用于文件/数据框/任何文件):
c=collections.Counter()
for i in fh:
x = i.rstrip().split(" ")
c.update(set(zip(x[:-1],x[1:])))
现在c
保持每2克的频率。
说明:
split
,用空格隔开。zip()
返回一个长度为2(2克)的元组的迭代器。set()
中以删除多余的内容。collections.Counter()
对象中,该对象跟踪每个元组出现的次数。您需要import collections
才能使用它。是的,Python很棒。
答案 1 :(得分:0)
这是非常c的风格,但是可以。想法是跟踪每个文档的“当前”二元组,确保每个文档(cur_bigrams = set()
仅添加一次,然后在每个文档后增加全局频率计数器(bigram_freq
)如果它在当前文档中。然后根据bigram_freq
中的信息(跨文档的全局计数器)构建一个新的数据框。
bigram_freq = {}
for doc in df["text_column"]:
cur_bigrams = set()
words = doc.split(" ")
bigrams = zip(words, words[1:])
for bigram in bigrams:
if bigram not in cur_bigrams: # Add bigram, but only once/doc
cur_bigrams.add(bigram)
for bigram in cur_bigrams:
if bigram in bigram_freq:
bigram_freq[bigram] += 1
else:
bigram_freq[bigram] = 1
result_df = pd.DataFrame(columns=["2_gram", "count"])
row_list = []
for bigram, freq in bigram_freq.items():
row_list.append([bigram[0] + " " + bigram[1], freq])
for i in range(len(row_list)):
result_df.loc[i] = row_list[i]
print(result_df)
输出:
2_gram count
0 a book 3
1 is a 3
2 This is 3
3 is read 1
4 that is 1
5 book that 1
6 he doesn't 1
7 this is 1
8 book but 1
9 but he 1
10 think this 1
11 doesn't think 1
您可以使用更具功能性的样式和/或列表理解功能来将代码精简一些。我将其留给读者练习。