Python-忽略Bigram频率中的数字和符号

时间:2017-02-24 15:53:57

标签: python nltk

我试图从txt文件的文本中找到Bi-gram频率。到目前为止它的工作原理,但它计算数字和符号。这是我的代码:

import  nltk
from nltk.collocations import *
import prettytable




file = open('tweets.txt').read()
tokens = nltk.word_tokenize(file)


pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'



bgs = nltk.bigrams(tokens)
fdist = nltk.FreqDist(bgs)

for row  in fdist.most_common(100):
    pt.add_row(row)
print pt


Below is the code output:
+------------------------------------+--------+
| Words                              | Counts |
+------------------------------------+--------+
| ('https', ':')                     |   1615 |
| ('!', '#')                         |    445 |
| ('Thank', 'you')                   |    386 |
| ('.', '``')                        |    358 |
| ('.', 'I')                         |    354 |
| ('.', 'Thank')                     |    337 |
| ('``', '@')                        |    320 |
| ('&', 'amp')                       |    290 |

有没有办法忽略数字和符号(比如!,。,?,:)?由于文本是推文,我想忽略数字和符号,除了#' s和@' s

1 个答案:

答案 0 :(得分:0)

bigrams的fdist是包含bigram元组和计数整数的元组元组,所以我们需要访问bigram元组并且只保留我们需要的元组以及bigram的数量。试试这个:

import nltk
from nltk.probability import FreqDist
from nltk.util import ngrams
from pprint import pprint 

def filter_most_common_bigrams(mc_bigrams_counts):
    filtered_mc_bigrams_counts = []
    for mc_bigram_count in mc_bigrams_counts:
        bigram, count = mc_bigram_count
        #print (bigram, count)
        if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha():
            filtered_mc_bigrams_counts.append((bigram, count))
    return tuple(filtered_mc_bigrams_counts)   

text = """Is there a way to ignore numbers and symbols ( like !,.,?,:)?
Since the text are tweets, I want to ignore numbers and symbols, except for the #'s and @'s
https: !# . Thank you . `` 12 hi . 1st place 1 love 13 in @twitter # twitter"""

tokenized_text = nltk.word_tokenize(text)
bigrams = ngrams(tokenized_text, 2)
fdist = FreqDist(bigrams)
mc_bigrams_counts = fdist.most_common(100)     
pprint (filter_most_common_bigrams(mc_bigrams_counts))

关键代码是:

if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha():
    filtered_mc_bigrams_counts.append((bigram, count))

这将检查二元组中的所有1格是字母,或者,第一个二元组是#或@符号,第二个二元组由字母组成。它只附加满足这些条件的那些,并且在包含bigram的fdist计数的元组中这样做。

结果:

((('to', 'ignore'), 2),
 (('and', 'symbols'), 2),
 (('ignore', 'numbers'), 2),
 (('numbers', 'and'), 2),
 (('for', 'the'), 1),
 (('@', 'twitter'), 1),
 (('Is', 'there'), 1),
 (('text', 'are'), 1),
 (('a', 'way'), 1),
 (('Thank', 'you'), 1),
 (('want', 'to'), 1),
 (('Since', 'the'), 1),
 (('I', 'want'), 1),
 (('#', 'twitter'), 1),
 (('the', 'text'), 1),
 (('are', 'tweets'), 1),
 (('way', 'to'), 1),
 (('except', 'for'), 1),
 (('there', 'a'), 1))