Question

我试图（在Python3上）学习如何对NLP进行情感分析，并且我使用Kaggle上提供的“ UMICH SI650-情感分类”数据库：https://www.kaggle.com/c/si650winter11

此刻，我正在尝试生成带有一些循环的词汇，下面是代码：

    import collections
    import nltk
    import os

    Directory = "../Databases"


    # Read training data and generate vocabulary
    max_length = 0
    freqs = collections.Counter()
    num_recs = 0
    training = open(os.path.join(Directory, "train_sentiment.txt"), 'rb')
    for line in training:
        if not line:
            continue
        label, sentence = line.strip().split("\t".encode())
        words = nltk.word_tokenize(sentence.decode("utf-8", "ignore").lower())
        if len(words) > max_length:
            max_length = len(words)
        for word in words:
            freqs[word] += 1
        num_recs += 1
    training.close()

我一直收到这个错误消息，但我不完全理解：

在标签中，句子= line.strip（）。split（“ \ t” .encode（）） ValueError：没有足够的值可解包（预期2，得到1）

我尝试添加

if not line:
        continue

像在这里建议的那样：ValueError : not enough values to unpack. why? 但这不适用于我的情况。我该如何解决这个错误？

非常感谢

Answer 1

解决此问题的最简单方法是将拆包语句放入//define an array of observables let request:Observable[]=[] if (Array.isArray(this.item.domains) || this.item.domains.length) { for (let i = 0; i < this.item.domains.length; i++){ //just fill the array of observables request.push(this.users.getUsers(this.item.domains[i].id)) } forkJoin(request).subscribe((response:any[])=>{ //in response[0], we have the response of getUsers(this.item.domain[0].id) //in response[1], we have the response of getUsers(this.item.domain[1].id) .... for (let res in response) { //You can use concat and filter let users=Res.response.items .filter(us=>us.user!=undefined) .map(us=>us.user); this.user.concat(users); } }) }块中。像这样：

try/except

我的猜测是，您的某些行中的标签之后只包含空格。

Answer 2

这是从https://www.kaggle.com/c/si650winter11

读取数据集的更简洁的方法

首先，上下文管理器是您的朋友，请使用http://book.pythontips.com/en/latest/context_managers.html

其次，如果它是文本文件，请避免将其读取为二进制文件，即open(filename, 'r')而不是open(filename, 'rb')，则无需弄乱str / byte并进行编码/解码。

现在：

from nltk import word_tokenize
from collections import Counter
word_counts = Counter()
with open('training.txt', 'r') as fin:
    for line in fin:
        label, text = line.strip().split('\t')
        # Avoid lowercasing before tokenization.
        # lowercasing after tokenization is much better
        # just in case the tokenizer uses captialization as cues. 
        word_counts.update(map(str.lower, word_tokenize(text)))

print(word_counts)

Answer 3

您应该检查字段数错误的情况：

 if not line:
     continue
 fields = line.strip().split("\t".encode())
 if len(fields) != 2:
     # you could print(fields) here to help debug
     continue
 label, sentence = fields

ValueError：没有足够的值可解压缩

3 个答案: