帮我弄清楚我的python代码有什么问题。
那就是代码
import nltk
import re
import pickle
raw = open('tom_sawyer_shrt.txt').read()
### this is how the basic Punkt sentence tokenizer works
#sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(raw)
### train & tokenize text using text
sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw)
sent_tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer(sent_trainer)
# break in to sentences
sents = sent_tokenizer.tokenize(raw)
# get sentence start/stop indexes
sentspan = sent_tokenizer.span_tokenize(raw)
### Remove \n in the middle of setences, due to fixed-width formatting
for i in range(0,len(sents)-1):
sents[i] = re.sub('(?<!\n)\n(?!\n)',' ',raw[sentspan[i][0]:sentspan[i+1][0]])
for i in range(1,len(sents)):
if (sents[i][0:3] == '"\n\n'):
sents[i-1] = sents[i-1]+'"\n\n'
sents[i] = sents[i][3:]
### Loop thru each sentence, fix to 140char
i=0
tweet=[]
while (i<len(sents)):
if (len(sents[i]) > 140):
ntwt = int(len(sents[i])/140) + 1
words = sents[i].split(' ')
nwords = len(words)
for k in range(0,ntwt):
tweet = tweet + [
re.sub('\A\s|\s\Z', '', ' '.join(
words[int(k*nwords/float(ntwt)):
int((k+1)*nwords/float(ntwt))]
))]
i=i+1
else:
if (i<len(sents)-1):
if (len(sents[i])+len(sents[i+1]) <140):
nextra = 1
while (len(''.join(sents[i:i+nextra+1]))<140):
nextra=nextra+1
tweet = tweet+[
re.sub('\A\s|\s\Z', '',''.join(sents[i:i+nextra]))
]
i = i+nextra
else:
tweet = tweet+[re.sub('\A\s|\s\Z', '',sents[i])]
i=i+1
else:
tweet = tweet+[re.sub('\A\s|\s\Z', '',sents[i])]
i=i+1
### A last pass to clean up leading/trailing newlines/spaces.
for i in range(0,len(tweet)):
tweet[i] = re.sub('\A\s|\s\Z','',tweet[i])
for i in range(0,len(tweet)):
tweet[i] = re.sub('\A"\n\n','',tweet[i])
### Save tweets to pickle file for easy reading later
output = open('tweet_list.pkl','wb')
pickle.dump(tweet,output,-1)
output.close()
listout = open('tweet_lis.txt','w')
for i in range(0,len(tweet)):
listout.write(tweet[i])
listout.write('\n-----------------\n')
listout.close()
那就是错误信息
追踪(最近一次通话): 文件&#34; twain_prep.py&#34;,第13行,in sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer()。train(raw) 火车上的文件&#34; /home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py" ;,第1227行 token_cls = self._Token).get_params() 文件&#34; /home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py",第649行, init self.train(train_text,verbose,finalize = True) 火车上的文件&#34; /home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py" ;,第713行 self._train_tokens(self._tokenize_words(text),verbose) 文件&#34; /home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py" ;,第729行,在_train_tokens中 tokens = list(令牌) 文件&#34; /home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py" ;,第542行,在_tokenize_words中 对于plaintext.split中的行(&#39; \ n&#39;): UnicodeDecodeError:&#39; ascii&#39;编解码器不能解码位置6中的字节0xe2:序数不在范围内(128)
答案 0 :(得分:1)
UnicodeDecodeError
。基本上,Python字符串仅处理ascii
值,这就是为什么当您将文本发送到tokenizer
时,它必须包含一些不在ascii
列表中的字符。
那么如何解决?
您可以将文字转换为ascii
个字符,然后忽略&#39; Unicode&#39;的。
raw = raw..encode('ascii', 'ignore')
另外,您可以阅读此post来处理Unicode
错误。