使用train_test_split拆分数据集时收到错误

时间:2018-11-16 04:47:28

标签: python

我尝试适应使用train_test_split分割的json文件,但是这样做时出现错误。但是,当我使用另一种方法时,它会很好地工作。我不明白为什么会这样吗?

代码:

total_files = []                
folders = os.listdir("files")
for fs in folders:
    files = os.listdir("files/{}".format(fs))
    for i, n in enumerate(files):
        with open("files/{}/".format(fs, fname), "r") as f:
            load = json.load(f)
            total_files.append(load)

X = y = total_files
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

我的X_trainX_testy_trainy_test都是像[List[List[Dict]]]这样的json文件

然后我使用分类器

**classifier**.fit(X_train, y_train)

然后我收到这样的错误:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

当我使用另一种拆分方式时:

folders = os.listdir("files")
for fs in folders:
    X_train = []
    y_train = []
    X_test = []
    y_test = []
    files = os.listdir("files/{}".format(fs))
    num = len(files)
    th = num * 0.8
    for i, n in enumerate(files):
        with open("files/{}/".format(fs, fname), "r") as f:
            load = json.load(f)
            if i < th:
                X_train.append(load)
                y_train.append(fs)
            else:
                X_test.append(load)
                y_test.append(fs)

然后这完美地工作了,我不知道这两种方法之间有什么区别。两者都打印X_train以获得相似的输出。

0 个答案:

没有答案