如何查找文档中特定标记集的术语频率

时间:2014-03-11 16:05:44

标签: python annotations tf-idf

如何找到每个注释的频率;作者,年,郎,以及他们的unigrams,bi-gram,trigrams ...... ngram的出现频率,即

"<author>James Parker</author><year>2008</year><lang>English</lang>"
"<author>Van Wie</author><year>2002</year>"
"<year>2012</year><lang>English</lang>"
"<year>2002</year><lang>French</lang>"


 file = 'file.csv'
 df = pd.read_csv(file)               
 lines = df['query']
 for line in lines:    

     #calculate tag frequency

  #calculate frequencies of unigram, bigrams, trigrams,....ngram tags 

> author: 3, year: 4, lang: 3

  trigram: author, year, lang : 1
  bigram: author, year: 1
  bigram: year, lang: 2

1 个答案:

答案 0 :(得分:0)

如果我正确读取这个,你每行只计算1 ngram,所以行

"<author>James Parker</author><year>2008</year><lang>English</lang>" 

有一个三元组和三个unigrams。您不需要每行的所有组合。

最简单的计算方法是使用标签或元组访问的字典来存储计数。这会给你一次通过,并且应该与输入行的数量很好地扩展。我使用正则表达式来拉出每个标签的第一个(这意味着输入必须很好地形成),然后通过标签名称然后通过由标签名称集合形成的n元组索引到计数器。 / p>

import collections
import re

string = """<author>James Parker</author><year>2008</year><lang>English</lang>
<author>Van Wie</author><year>2002</year>
<year>2012</year><lang>English</lang>
<year>2002</year><lang>French</lang>"""

strings = string.split("\n")
counter = collections.Counter()

tag_re = "\<[^/\>]*\>"
for s in strings:
    tags = re.findall(tag_re, s)
    tags.sort()
    # use name directly
    for tag in tags:
        counter[tag] += 1
    # use set for ngram
    ngram = tuple(tags)
    counter[ngram] += 1

print counter

打印:

Counter({'<year>': 4, '<lang>': 3, '<author>': 2, ('<year>', '<lang>'): 2, ('<author>', '<year>'): 1, ('<author>', '<year>', '<lang>'): 1})