我需要通过属性'ids'
将数据框随机分成两个不相交的集合。例如,请考虑以下数据框:
df=
Out[470]:
0 1 2 3 ids
0 17.0 18.0 16.0 15.0 13.0
1 18.0 16.0 15.0 15.0 13.0
2 16.0 15.0 15.0 16.0 13.0
131 12.0 8.0 21.0 19.0 14.0
132 8.0 21.0 19.0 20.0 14.0
133 21.0 19.0 20.0 9.0 14.0
248 NaN NaN 12.0 11.0 17.0
249 NaN 12.0 11.0 10.0 17.0
250 12.0 11.0 10.0 NaN 17.0
287 3.0 3.0 1.0 8.0 20.0
288 3.0 1.0 8.0 3.0 20.0
289 1.0 8.0 3.0 3.0 20.0
413 21.0 7.0 16.0 18.0 25.0
414 7.0 16.0 18.0 19.0 25.0
415 16.0 18.0 19.0 18.0 25.0
665 10.0 8.0 8.0 7.0 27.0
666 8.0 8.0 7.0 9.0 27.0
667 8.0 7.0 9.0 8.0 27.0
790 NaN NaN 15.0 NaN 33.0
791 NaN 15.0 NaN 10.0 33.0
792 15.0 NaN 10.0 NaN 33.0
812 NaN 16.0 NaN 17.0 34.0
813 16.0 NaN 17.0 NaN 34.0
814 NaN 17.0 NaN 13.0 34.0
944 3.0 4.0 3.0 18.0 35.0
945 4.0 3.0 18.0 18.0 35.0
946 3.0 18.0 18.0 11.0 35.0
1059 9.0 10.0 3.0 4.0 56.0
1060 10.0 3.0 4.0 3.0 56.0
1061 3.0 4.0 3.0 3.0 56.0
... ... ... ... ...
10125 NaN 9.0 5.0 5.0 101317.0
10126 9.0 5.0 5.0 5.0 101317.0
10127 5.0 5.0 5.0 7.0 101317.0
我需要得到两个(用一些分数大小随机分隔)数据框,其中没有交叉值为ids
。
我知道如何在“非pandasian”中解决这个问题。方式:
ids
ids
.isin()
的值选择行
我想知道是否有一种简单而巧妙的方法来处理一些pandas内置函数,例如.sample()
?
答案 0 :(得分:5)
使用sklearn.model_selection.GroupShuffleSplit
执行拆分:
from sklearn.model_selection import GroupShuffleSplit
# Initialize the GroupShuffleSplit.
gss = GroupShuffleSplit(n_splits=1, test_size=0.5)
# Get the indexers for the split.
idx1, idx2 = next(gss.split(df, groups=df.ids))
# Get the split DataFrames.
df1, df2 = df.iloc[idx1], df.iloc[idx2]
答案 1 :(得分:2)
<强>更新强>
df1 = df.sample(frac=1).loc[df.ids % 2 == 0]
df2 = df.loc[df.index.difference(df1.index)]
OLD 不正确(无需分离ID)答案:
您可以先使用sample(frac=1)
对您的DF进行随机播放,然后使用np.split():
df1, df2 = np.split(df.sample(frac=1), 2)