如何将生成器数据拆分成列车并进行测试而不转换为密集数据?

时间:2017-12-03 16:33:26

标签: tensorflow scipy scikit-learn

我希望将生成器数据拆分为训练和测试,而不转换为密集数据以减少RAM消耗。

import operator
import random

ops = {'+': operator.add,
       '-': operator.sub}  # add mul and div if you wish

keys_tuple = tuple(ops.keys())

while True:

    num_a = random.randint(1, 10)  # use larger range if you wish
    num_b = random.randint(1, 10)  # use larger range if you wish
    op = random.choice(keys_tuple)

    print('{}{}{}=?'.format(num_a, op, num_b))
    expected_answer = ops[op](num_a, num_b)
    user_answer = int(input())
    if user_answer == expected_answer:
        print('Correct')
    else:
        print('Wrong')

然而import numpy as np import tensorflow as tf from sklearn.model_selection import train_test_split # Data set ds = np.array([ ('Alice', 0), ('Bob', 1), ('Charlie', 1), ]) x = ds[:, 0] y = ds[:, 1] # Change texts into numeric vectors max_sequence = max(x, key=len) vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(len(max_sequence)) text_processed = vocab_processor.fit_transform(x) print(type(text_processed)) # <class 'generator'> # Split into training and test x_train, \ x_test, \ y_train, \ y_test = train_test_split(text_processed, y) 抱怨:

train_test_split

问题

  • 如何将text_processed拆分为稀疏数据?
  • 是否值得尝试CountVectorizer而不是TypeError: Singleton array array(<generator object VocabularyProcessor.transform at 0x116f6f830>, dtype=object) cannot be considered a valid collection`

上下文

假设我正在尝试使用更多数据和更长文本的this spam/ham tutorial

0 个答案:

没有答案