我尝试适应使用train_test_split
分割的json文件,但是这样做时出现错误。但是,当我使用另一种方法时,它会很好地工作。我不明白为什么会这样吗?
代码:
total_files = []
folders = os.listdir("files")
for fs in folders:
files = os.listdir("files/{}".format(fs))
for i, n in enumerate(files):
with open("files/{}/".format(fs, fname), "r") as f:
load = json.load(f)
total_files.append(load)
X = y = total_files
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
我的X_train
,X_test
,y_train
,y_test
都是像[List[List[Dict]]]
这样的json文件
然后我使用分类器
**classifier**.fit(X_train, y_train)
然后我收到这样的错误:
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
当我使用另一种拆分方式时:
folders = os.listdir("files")
for fs in folders:
X_train = []
y_train = []
X_test = []
y_test = []
files = os.listdir("files/{}".format(fs))
num = len(files)
th = num * 0.8
for i, n in enumerate(files):
with open("files/{}/".format(fs, fname), "r") as f:
load = json.load(f)
if i < th:
X_train.append(load)
y_train.append(fs)
else:
X_test.append(load)
y_test.append(fs)
然后这完美地工作了,我不知道这两种方法之间有什么区别。两者都打印X_train
以获得相似的输出。