Question

我需要将我读过的pandas数据框分成csv，这个数据集需要分成3组，训练测试和验证。但我的问题是我不知道csv有多少属性，因为我正在使用大量具有不同大小属性的 base （其中有3或4，其他有40+）。我需要分开部分

培训= 50％
测试= 25％
验证= 25％

因此，如果我有5个属性，每个属性有100个值，那么我需要为火车获得50行。我如何分离所有属性，在最后我得到每个组的新数据框，始终保持正确的比例已经实现了读取csv的函数，如果你可以看到它们是通用的，因为它们只接收csv的路径并返回一个新的数据帧。

import pandas as pd


class Entity:

    def __init__(self, path):
        self.data_frame = pd.read_csv(path)

    def get_value(self, attr):
        return self.data_frame[attr]

    def split_set(self):
        pass

这个类是通用的，我需要创建这个函数split_set来分隔集合。我现在开始使用熊猫和蟒蛇了，对不起，如果这显然很容易解决，但我想不出一个好的方法来做到这一点。提前谢谢。

Answer 1

在数据中添加R列。为其分配行的散列或随机数，因此其值介于0和1之间。

然后0 <= R＆lt; .5表示训练排， .5＆lt; = R＆lt; .75意味着测试，和 .75＆lt; = R＆lt; 1意味着验证。

Answer 2

我认为您可以随机重新排序数据框并选择前50％作为列车，50％-75％作为测试，75％-100％。

df = df.sample(frac=1)  # randomly reorder the whole dataframe
n_rows = len(df)

train_idx = n_rows // 2
test_idx = train_idx + n_rows // 4

train = df.iloc[:train_idx, :]
test = df.iloc[train_idx: test_idx, :]
val = df.iloc[test_idx:, :]

希望它有所帮助！

Answer 3

你可以在sklearn库中使用一种方法是sklearn.model_selection.train_test_split。

import numpy as np
from sklearn.model_selection import train_test_split

X= np.arange(10).reshape((5, 2))
X_train, X_test = train_test_split(X, test_size=0.33, random_state=42)

然后您可以看到数据被分成训练和测试数据集。对于更多数据集，您可以重复该步骤，直到获得所需数据为止。

Answer 4

您可以使用sklearn库

import sklearn
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, train_size=0.5)

获取关于pandas的培训集

4 个答案: