如何找到每个注释的频率;作者,年,郎,以及他们的unigrams,bi-gram,trigrams ...... ngram的出现频率,即
"<author>James Parker</author><year>2008</year><lang>English</lang>"
"<author>Van Wie</author><year>2002</year>"
"<year>2012</year><lang>English</lang>"
"<year>2002</year><lang>French</lang>"
file = 'file.csv'
df = pd.read_csv(file)
lines = df['query']
for line in lines:
#calculate tag frequency
#calculate frequencies of unigram, bigrams, trigrams,....ngram tags
> author: 3, year: 4, lang: 3
trigram: author, year, lang : 1
bigram: author, year: 1
bigram: year, lang: 2
答案 0 :(得分:0)
如果我正确读取这个,你每行只计算1 ngram,所以行
"<author>James Parker</author><year>2008</year><lang>English</lang>"
有一个三元组和三个unigrams。您不需要每行的所有组合。
最简单的计算方法是使用标签或元组访问的字典来存储计数。这会给你一次通过,并且应该与输入行的数量很好地扩展。我使用正则表达式来拉出每个标签的第一个(这意味着输入必须很好地形成),然后通过标签名称然后通过由标签名称集合形成的n元组索引到计数器。 / p>
import collections
import re
string = """<author>James Parker</author><year>2008</year><lang>English</lang>
<author>Van Wie</author><year>2002</year>
<year>2012</year><lang>English</lang>
<year>2002</year><lang>French</lang>"""
strings = string.split("\n")
counter = collections.Counter()
tag_re = "\<[^/\>]*\>"
for s in strings:
tags = re.findall(tag_re, s)
tags.sort()
# use name directly
for tag in tags:
counter[tag] += 1
# use set for ngram
ngram = tuple(tags)
counter[ngram] += 1
print counter
打印:
Counter({'<year>': 4, '<lang>': 3, '<author>': 2, ('<year>', '<lang>'): 2, ('<author>', '<year>'): 1, ('<author>', '<year>', '<lang>'): 1})