Question

我正在使用datatable数据框。如何将数据框分为训练和测试数据集？
与pandas数据框类似，我尝试使用sklearn.model_selection中的train_test_split(dt_df,classes)，但是它不起作用，并且出现错误。

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split

dt_df = dt.fread(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

我收到以下错误：TypeError：列选择器必须是整数或字符串，而不是

我通过将数据帧转换为numpy数组来尝试解决方法：

classe = np.ravel(dt_df[:, "classe"])
dt_df = dt_df.to_numpy()

就像它可以工作一样，但是，我不知道是否有一种方法可以使train_test_split像在熊猫数据框中一样正常工作。

编辑1：。csv文件包含作为字符串的列，并且值是unsigned int。使用print(dt_df)，我们得到：

     | CCC  CCG  CCU  CCA  CGC  CGG  CGU  CGA  CUC  CUG  …  
---- + ---  ---  ---  ---  ---  ---  ---  ---  ---  ---     
   0 |   0    0    0    0    2    0    1    0    0    1  …  
   1 |   0    0    0    0    1    0    2    1    0    1  …  
   2 |   0    0    0    1    1    0    1    0    1    2  …  
   3 |   0    0    0    1    1    0    1    0    1    2  …  
   4 |   0    0    0    1    1    0    1    0    1    2  …  
   5 |   0    0    0    1    1    0    1    0    1    2  …  
   6 |   0    0    0    1    0    0    3    0    0    2  …  
   7 |   0    0    0    1    1    0    0    0    1    2  …  
   8 |   0    0    0    1    1    0    1    0    1    2  …  
   9 |   0    0    1    0    1    0    1    0    1    3  …  
  10 |   0    0    1    0    1    0    1    0    1    3  …  
      ...

感谢您的帮助。

Answer 1

我不知道可以分割dt的函数。但是你可以我们

dt_df = df.read_csv(csv_file_path)
classe = dt_df[:, "classe"])
del dt_df[:, "classe"])

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

，然后通过以下方式将DataFame转换为DataTable：

X_train = dt.Frame(X_train)
X_test = dt.Frame(X_test)

Answer 2

我使用来自sklearn.model_selection的train_test_split(dt_df,classes)将数据表数据框拆分为训练和测试数据集的解决方案是将数据表数据框转换为我在问题帖中提到的numpy或注释为pandas数据框@Manoor Hassan撰写（往返）：

分割方法之前的源代码：

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

dt_df = dt.fread(csv_file_path)

classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])

拆分方法后的源代码：

ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)

方法1 ：转换为numpy

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

方法2 ：转换为numpy并在拆分后返回数据表数据帧：

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

X_train = dt.Frame(X_train)

# source code after split method

方法3 ：转换为熊猫数据框

# source code before split method

dt_df = dt_df.to_pandas()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

对于大约 500 Mo 的csv文件，这3种方法可以正常工作，但是火车的时间性能（ExTrCl.fit）和预测的时间性能（ExTrCl.predict）有所不同。我有这些结果：

                       T convert    T.train     T.pred
M1 to_numpy             3           85          0.5
M2 to_numpy and back    3.5         29          0.5
M3 to pandas            4           37          4

如何在Python中将数据表数据框拆分为训练和测试数据集

2 个答案: