Question

从两个文件中加载两个数据帧（testdf和datadf），然后使用

df = pd.concat([testdf,datadf]) 到目前为止，产生了一个（48842,15）的df.shape。

现在我需要80％的火车，10％的测试，10％的验证

trndf = df.sample(frac=0.8)返回正确的形状（39074,15）。

tmpdf = df.drop(trndf.index)现在这里的想法是从df数据帧中删除那些39074行，这应该总共留下9768.但是tmpdf数据帧形状是（4514,15），丢失了5254行。

df使用默认索引，编号从0到48841，样本低于

idx age work class 0 25 Private 1 28 Private

下面的trndf数据帧样本是随机样本，我确认索引号与df dataframe中的索引匹配

idx age work class 228 25 ? 2164 35 State-gov

了解它如何设法丢失这些额外的行。感谢对此的任何见解。感谢

Answer 1

默认情况下pd.concat不会重置索引，因此如果testdf和datadf中都存在索引，那么当这些索引同时出现时，它们都会被删除被抽样出来。

drop将丢弃所有重复索引，因此您会从testdf和datadf中存在的索引中丢失更多行。

潜在解决方案正在将df = pd.concat([testdf,datadf])更改为

df = pd.concat([testdf,datadf]).reset_index()

或

df = pd.concat([testdf,datadf], ignore_index=True)

问题转载：

df = pd.DataFrame({'a': {0: 0.6987303529918656,
  1: -1.4637804486869905,
  2: 0.4512092453413682,
  3: 0.03898323021771516,
  4: -0.143758037238284,
  5: -1.6277278110578157}})

df_combined = pd.concat([df, df])
print(df_combined)
print(df_combined.shape)
sample = df_combined.sample(frac=0.5)
print(sample.shape)
df_combined.drop(sample.index).shape

          a
0  0.698730
1 -1.463780
2  0.451209
3  0.038983
4 -0.143758
5 -1.627728
0  0.698730
1 -1.463780
2  0.451209
3  0.038983
4 -0.143758
5 -1.627728
(12, 1) # print(df_combined.shape)
(6, 1)  # print(sample.shape)
Out[37]:
(4, 1)  # df_combined.drop(sample.index).shape

熊猫0.22 dataframe.drop比它应该多的行

1 个答案: