ValueError:没有足够的值可解压缩

时间:2018-08-20 20:15:50

标签: python-3.x dictionary nlp nltk sentiment-analysis

我试图(在Python3上)学习如何对NLP进行情感分析,并且我使用Kaggle上提供的“ UMICH SI650-情感分类”数据库:https://www.kaggle.com/c/si650winter11

此刻,我正在尝试生成带有一些循环的词汇,下面是代码:

    import collections
    import nltk
    import os

    Directory = "../Databases"


    # Read training data and generate vocabulary
    max_length = 0
    freqs = collections.Counter()
    num_recs = 0
    training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
    for line in training:
        if not line:
            continue
        label, sentence = line.strip().split("\t".encode())
        words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
        if len(words) > max_length:
            max_length = len(words)
        for word in words:
            freqs[word] += 1
        num_recs += 1
    training.close()

我一直收到这个错误消息,但我不完全理解:

  

在标签中,句子= line.strip()。split(“ \ t” .encode())   ValueError:没有足够的值可解包(预期2,得到1)

我尝试添加

if not line:
        continue

像在这里建议的那样:ValueError : not enough values to unpack. why? 但这不适用于我的情况。我该如何解决这个错误?

非常感谢

3 个答案:

答案 0 :(得分:1)

解决此问题的最简单方法是将拆包语句放入//define an array of observables let request:Observable[]=[] if (Array.isArray(this.item.domains) || this.item.domains.length) { for (let i = 0; i < this.item.domains.length; i++){ //just fill the array of observables request.push(this.users.getUsers(this.item.domains[i].id)) } forkJoin(request).subscribe((response:any[])=>{ //in response[0], we have the response of getUsers(this.item.domain[0].id) //in response[1], we have the response of getUsers(this.item.domain[1].id) .... for (let res in response) { //You can use concat and filter let users=Res.response.items .filter(us=>us.user!=undefined) .map(us=>us.user); this.user.concat(users); } }) } 块中。像这样:

try/except

我的猜测是,您的某些行中的标签之后只包含空格。

答案 1 :(得分:1)

这是从https://www.kaggle.com/c/si650winter11

读取数据集的更简洁的方法

首先,上下文管理器是您的朋友,请使用http://book.pythontips.com/en/latest/context_managers.html

其次,如果它是文本文件,请避免将其读取为二进制文件,即open(filename, 'r')而不是open(filename, 'rb'),则无需弄乱str / byte并进行编码/解码。

现在

from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
    for line in fin:
        label, text = line.strip().split('\t')
        # Avoid lowercasing before tokenization.
        # lowercasing after tokenization is much better
        # just in case the tokenizer uses captialization as cues. 
        word_counts.update(map(str.lower, word_tokenize(text)))

print(word_counts)

答案 2 :(得分:0)

您应该检查字段数错误的情况:

 if not line:
     continue
 fields = line.strip().split("\t".encode())
 if len(fields) != 2:
     # you could print(fields) here to help debug
     continue
 label, sentence = fields