提前感谢您的帮助。我非常迷茫。我试图导入语料库,然后将三元组打印到csv文件中,其频率分布和相对频率位于包含整个三元组的第一列旁边的两列中。但是我不能很好地理解RegexTokenizer。下面的代码有90%可以使用,但是RegexTokenizer只查找字母,因此它会像使用#34这样的连词来分割短语。不要离开"三卦:"不要去"
我需要它来停止这样做。如果没有RegexTokenizer,三卦看起来像这样:(你'中'你'班',你'美国人')我想你可以使用RegexTokenizer来查找u'之间的短语和'但我不知道该怎么做。
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
import math
from decimal import *
from nltk.tokenize import RegexpTokenizer, WhitespaceTokenizer
import csv
#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root...'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
tokenizer = RegexpTokenizer(r'\w+')
raw = speeches.raw().lower()
tokens = tokenizer.tokenize(raw)
tgs = nltk.trigrams(tokens)
fdist = nltk.FreqDist(tgs)
minscore = 200
numwords = len(raw)
c = csv.writer(open("TPNngrams.csv", "wb"))
for k,v in fdist.items():
if v > minscore:
rf = Decimal(v)/Decimal(numwords)
firstword, secondword, thirdword = k
trigram = firstword + " " + secondword + " " + thirdword
results = trigram,v,rf
c.writerow(results)
print firstword, secondword, thirdword, v, rf
我还会随机收到此错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 0: ordinal not in range(128)
答案 0 :(得分:0)
要修复正则表达式标记生成器,请将您的标记生成器替换为:
text = "We have 15 billion dollars in gold in our treasury; we don't own an ounce."
tokenizer = RegexpTokenizer(r'(\w|\')+')
tokens = tokenizer.tokenize(text)
# ['We', 'have', '15', 'billion', 'dollars', 'in', 'gold', 'in', 'our', 'treasury', 'we', "don't", 'own', 'an', 'ounce']
处理连词。
我不确定错误发生的位置(也许会提供更多信息?)但是我猜你正在导入Python不知道如何处理的奇怪字符。尝试添加
# -*- coding: utf8 -*-
位于.py文件的最顶层。