Question

目标：
我正在尝试使用keras-text将文本字符串分为5个不同的类。我正在处理用于为本文分配会议的文章标题。预处理中出现问题：

我正在使用keras_text.data中的dataset（）函数，该函数将输入X（输入数据），y（属于数据的标签）和标记化器作为输入。我使用标准的wordtokenizer。对于X我使用一个numpy数组形状（21643,1），因为我使用一个numpy数组形状（21634，）。

尝试使用.update_test_indices（test_size = 0.1）时出现以下错误：

ValueError: Found input variables with inconsistent numbers of samples: [21643, 108215]

每个输入字符串可以有5个标签中的1个。如您所见，输入长度和样本长度不一致。样本的长度恰好是输入的5倍。如果我更改样本以便仅分配4个标签中的1个，则形状将更改为（86572），这恰好是输入的4倍。 所以标签似乎与输入相乘。

当我使用keras_text.data.dataset.num_classes打印类的数量时，似乎只识别了1个类，我不知道为什么会这样。

数据集（）需要什么样的输入格式？如何使用update_test_indices（）？

我的完整代码：

from keras_text.processing import WordTokenizer
from keras_text.data import Dataset
import numpy as np

def readFile(file):
    X = []
    y = []
    with open(file) as f:
        for line in f:
            nr, label, input = line.rstrip().split('\t')
            X.append(input)
            #y.append(label)
            if label == 'ISCAS':
                y.append(1)
                continue
            if label == 'SIGGRAPH':
                y.append(2)
                continue
            if label == 'WWW':
                y.append(3)
                continue
            if label == 'INFOCOM':
                y.append(4)
                continue
            if label == 'VLDB':
                y.append(5)
            else:
                print('wrong',label)
    npX = np.asarray(X)
    npX = npX.reshape(len(X), 1)
    npy = np.asarray(y)
    npy = npy.reshape(len(X))
    return npX, npy

X, y = readFile('Trainset.txt')
print(X.shape)
print(y.shape)

tokenizer = WordTokenizer()
ds = Dataset(X, y, tokenizer=tokenizer)
ds.update_test_indices(test_size=0.1)
print(ds.num_classes)
ds.save('dataset')

作为输出我得到：

Using TensorFlow backend.
(21643, 1)
(21643,)
Traceback (most recent call last):
  File "test.py", line 42, in <module>
    ds.update_test_indices(test_size=0.1)
  File "..\site-packages\keras_text\data.py", line 54, in update_test_indices
    self._train_indices, self._test_indices = next(sss.split(self.X, self.y))
  File "..\site-packages\sklearn\model_selection\_split.py", line 1203, in split
    X, y, groups = indexable(X, y, groups)
  File "..\site-packages\sklearn\utils\validation.py", line 229, in indexable
    check_consistent_length(*result)
  File "..\site-packages\sklearn\utils\validation.py", line 204, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [21643, 86572]

Process finished with exit code 1

更新我发现问题只发生在多类情况中。如果我将标签更改为只有两个类，它可以工作。但是，我不知道如何使用多个类。 有人可以帮助我吗？

Answer 1

问题似乎是，标签在Dataset()的构造函数中被编码为单热向量，但是拆分未考虑到这一点。在data.py第42行上注释.flatten()似乎可以解决此问题；我还没有用它训练一个完整的模型。

使用train_val_split（）或update_test_indices（）

1 个答案: