我试图(在Python3上)学习如何对NLP进行情感分析,并且我使用Kaggle上提供的“ UMICH SI650-情感分类”数据库:https://www.kaggle.com/c/si650winter11
此刻,我正在尝试生成带有一些循环的词汇,下面是代码:
import collections
import nltk
import os
Directory = "../Databases"
# Read training data and generate vocabulary
max_length = 0
freqs = collections.Counter()
num_recs = 0
training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
for line in training:
if not line:
continue
label, sentence = line.strip().split("\t".encode())
words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
if len(words) > max_length:
max_length = len(words)
for word in words:
freqs[word] += 1
num_recs += 1
training.close()
我一直收到这个错误消息,但我不完全理解:
在标签中,句子= line.strip()。split(“ \ t” .encode()) ValueError:没有足够的值可解包(预期2,得到1)
我尝试添加
if not line:
continue
像在这里建议的那样:ValueError : not enough values to unpack. why? 但这不适用于我的情况。我该如何解决这个错误?
非常感谢
答案 0 :(得分:1)
解决此问题的最简单方法是将拆包语句放入//define an array of observables
let request:Observable[]=[]
if (Array.isArray(this.item.domains) || this.item.domains.length) {
for (let i = 0; i < this.item.domains.length; i++){
//just fill the array of observables
request.push(this.users.getUsers(this.item.domains[i].id))
}
forkJoin(request).subscribe((response:any[])=>{
//in response[0], we have the response of getUsers(this.item.domain[0].id)
//in response[1], we have the response of getUsers(this.item.domain[1].id)
....
for (let res in response)
{
//You can use concat and filter
let users=Res.response.items
.filter(us=>us.user!=undefined)
.map(us=>us.user);
this.user.concat(users);
}
})
}
块中。像这样:
try/except
我的猜测是,您的某些行中的标签之后只包含空格。
答案 1 :(得分:1)
这是从https://www.kaggle.com/c/si650winter11
读取数据集的更简洁的方法首先,上下文管理器是您的朋友,请使用http://book.pythontips.com/en/latest/context_managers.html
其次,如果它是文本文件,请避免将其读取为二进制文件,即open(filename, 'r')
而不是open(filename, 'rb')
,则无需弄乱str / byte并进行编码/解码。
现在:
from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
for line in fin:
label, text = line.strip().split('\t')
# Avoid lowercasing before tokenization.
# lowercasing after tokenization is much better
# just in case the tokenizer uses captialization as cues.
word_counts.update(map(str.lower, word_tokenize(text)))
print(word_counts)
答案 2 :(得分:0)
您应该检查字段数错误的情况:
if not line:
continue
fields = line.strip().split("\t".encode())
if len(fields) != 2:
# you could print(fields) here to help debug
continue
label, sentence = fields