Question

我有一个大熊猫数据框，基本上是50K X9.5K尺寸。我的数据集是二进制文件，它只有1和0。并且有很多零。

将其视为用户商品购买数据，如果用户购买其他商品，则为1。用户是行，商品是列。

353 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
354 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
355 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
356 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
357 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0

我想分成培训，验证和测试集。然而，它不会被行按正常分割。

我想要的是，对于每个验证和测试集，我希望在原始数据中保留2-4个非零的列。

所以基本上如果我的原始数据为每个用户都有9.5K列，我首先只保留1500个左右列。然后我将这些采样数据吐入火车并进行测试，保持列车中的1495-1498列和测试/验证中的2-5列。正在测试的列只是那些非零的列。训练可以兼得。

我还想保持项目名称/索引与测试/验证中保留的项目名称/索引相对应

我不想运行循环来检查每个单元格值并将其放在下一个表中。

有什么想法吗？

编辑1：

所以这就是我想要实现的目标。

Answer 1

所以，非零，我猜你的意思是那些只包含其中一些的列。这很容易做到。最好的方法可能是使用sum，如下：

sums = df.sum(axis=1) # to sum along columns. You will have a Series with column names as indices, and column sums as values.
non_zero_cols = sums[sums = len(df)].index # this will have only column names with non-zero records

# Now to split the data into training and testing
test_cols = numpy.random.choice(non_zero_cols, 2, replace=False) # or 5, just randomly selecting columns.
test_data = df[test_cols]
train_data = df.drop(test_cols)

这是你在找什么？

Answer 2

IIUC：

threshold = 6
new_df = df.loc[df.sum(1) >= threshold]

df.sum(1)对每一行求和。由于它们是1和0 s，因此相当于计数。

df.sum(1) >= threshold会创建一系列True和False s，也称为布尔掩码。

df.loc碰巧接受布尔掩码作为切片的方法。

df.loc[df.sum(1) >= threshold]将布尔掩码传递给df.loc，并仅返回布尔掩码中具有相应True的行。

由于当True s的计数大于或等于1时，布尔掩码只有threshold s，这相当于返回每个数据帧的一个切片行至少有threshold个非零数。

然后参考this answer如何分成测试，训练和验证集。

或this answer

如何通过保留至少两个非零列

2 个答案: