我目前正在使用praw从Reddit的各个subreddits中提取评论并尝试计算他们的情绪并将其添加到数据库中。 它的工作原理是从包含subreddit名称的文件中读取,以便知道从哪个subreddit中提取注释。
with open('subs.txt') as f:
for line in f:
string = line.strip()
for submission in reddit.subreddit(string).hot(limit=10):
subreddit = reddit.subreddit(line.strip())
name = str(subreddit.display_name)
comments = submission.comments.list()
for c in comments:
if isinstance(c, MoreComments):
continue
#print c.body
author = c.author
score = c.score
created_at = c.created_utc
upvotes = c.ups
#print c.score
comment_sentiment = getSentiment(c.body)
subreddit_sentiment += comment_sentiment
num_comments += 1
我当前实现的工作正常,直到它达到某个注释,它会抛出以下错误消息:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: unexpected end of data
我在这里看了很多不同的问题,人们遇到了同样的问题,但给出的解决方案似乎没有帮助我解决问题。
完整堆栈跟踪如下:
Traceback (most recent call last):
File "extract.py", line 48, in <module>
comment_sentiment = getSentiment(c.body)
File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 93, in getSentiment
tagged_sentences = makeTag(pos_tag_text, max_key_size, dictionary)
File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 106, in makeTag
return [addTag(sentence, max_key_size, dictionary) for sentence in postagged_sentences]
File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 119, in addTag
expression_word = ' '.join([word[0] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: unexpected end of data
我一直绞尽脑汁试图想出解决这个问题的各种方法,不幸的是我迷失了。是否与从包含subreddits的文件中读取有关或者是否与使用praw提取数据的限制有关?我试图找出问题,但似乎无法摆脱这个错误。
有人能帮我解决这个问题吗?我很感激任何见解。 非常感谢。
修改 sentiment_analysis.py
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import pandas as pd
import nltk
import yaml
import sys
import os
import re
//splitting the text initially
def splitString(text):
nltk_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()
sentences = nltk_splitter.tokenize(text)
tokenized_sentences = [nltk_tokenizer.tokenize(sentence) for sentence in sentences]
return tokenized_sentences
def tagWords(sentence,max_key_size, dictionary, tag_stem=False):
# Tag all possible sentences
tagged_sentence = []
length = len(sentence)
if max_key_size == 0:
max_key_size = length
i = 0
while (i < length):
j = min(i + max_key_size, length)
tagged = False
while (j > i):
expression_word = ' '.join([word[0] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore') // here is where it gets caught
expression_stem = ' '.join([word[1] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')
if tag_stem == True:
word = expression_word
else:
word = expression_word
....
答案 0 :(得分:0)
尝试明确编码字符串:
.mp4