NLTK:朴素贝叶斯 - 在哪里/如何添加ngrams?

时间:2015-03-26 23:46:27

标签: python nlp classification nltk sentiment-analysis

我正在对推文进行分类任务(3个标签= pos,neg,中性),我在NLTK中使用Naive Bayes。我也想添加ngrams(bigrams)。我已经尝试将它们添加到代码中,但我似乎无法在适当的位置找到适合它们的位置。目前看起来好像我正在打破"打破"代码,无论我在bigrams中添加什么。有人可以帮助我,或者将我重定向到教程吗?

我的unigrams代码如下。如果您需要有关数据集外观的任何信息,我很乐意提供它。

import nltk
import csv
import random 
import nltk.classify.util, nltk.metrics
import codecs
import re, math, collections, itertools
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from nltk.probability import FreqDist, ConditionalFreqDist 
from nltk.util import ngrams
from nltk import bigrams
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
stemmer = SnowballStemmer("english", ignore_stopwords = True)
stopset = set(stopwords.words('english'))

stopset.add('username')
stopset.add('url')
stopset.add('percentage')
stopset.add('number')
stopset.add('at_user')
stopset.add('AT_USER')
stopset.add('URL')
stopset.add('percentagenumber')


inpTweets = []
##with open('sanders.csv', 'r', 'utf-8') as f:   #input sanders    
##    reader = csv.reader(f, delimiter = ';')    
##    for row in reader: 
##        inpTweets.append((row))
reader = codecs.open('...sanders.csv', 'r', encoding='utf-8-sig') #input classified tweets
for line in reader:
    line = line.rstrip()
    row = line.split(';')
    inpTweets.append((row))    

def processTweet(tweet):
    tweet = tweet.lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    tweet = re.sub('[\s]+', ' ', tweet)
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    tweet = tweet.strip('\'"')
    return tweet

def replaceTwoOrMore(s):
    #look for 2 or more repetitions of character and replace with the character itself
    pattern = re.compile(r"(.)\1{1,}", re.DOTALL)
    return pattern.sub(r"\1\1", s)


def preprocessing(doc): 
    tokens = tokenizer.tokenize(doc)
    bla = []
    for x in tokens:
        if len(x)>2:
            if x not in stopset:
                val = re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", x)
                if val is not None:
                    x = replaceTwoOrMore(x)
                    x = processTweet(x)
                    x = x.strip('\'"?,.')
                    x = stemmer.stem(x).lower()
                    bla.append(x)
    return bla

xyz = []

for lijn in inpTweets:
    xyz.append((preprocessing (lijn[0]),lijn[1]))
random.shuffle(xyz)

featureList = []
k = 0
while k in range (0, len(xyz)):
    featureList.extend(xyz[k][0])
    k = k + 1

fd = nltk.FreqDist(featureList)
featureList = list(fd.keys())[2000:]

def document_features(doc):    
    features = {}
    document_words = set(doc)
    for word in featureList:
        features['contains(%s)' % word] = (word in document_words)
    return features


featuresets =  nltk.classify.util.apply_features(document_features, xyz)

training_set, test_set = featuresets[2000:], featuresets[:2000]

classifier = nltk.NaiveBayesClassifier.train(training_set)

2 个答案:

答案 0 :(得分:0)

您的代码使用2000个最常用的单词作为分类功能。只需选择您要使用的双字母组,然后将其转换为document_features()中的功能即可。像"contains (the dog)"这样的功能就像"contains (dog)"一样。

答案 1 :(得分:-1)

一种有趣的方法是使用sequential backoff tagger,它允许您将标记符链接在一起:通过这种方式,您可以训练n-gram标记符和朴素贝叶斯(Naive Bayes)并将它们链接起来。