我有以下熊猫 df。
columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
['1','hello', '1.0'],
['1','hello', '1.0'],
['2', 'dog', '0.0'],
['2', 'cat', '1.0'],
['2', 'dog', '0.0'],
['2', 'the answer is cat', '1.0'],
['3', 'Milan', '1.0'],
['3', 'Paris', '0.0'],
['3', 'The capital is Paris', '0.0'],
['3', 'MILAN', '1.0'],
['4', 'The capital is Paris', '1.0'],
['4', 'London', '0.0'],
['4', 'Paris', '1.0'],
['4', 'paris', '1.0'],
['5', 'lol', '0.0'],
['5', 'rofl', '0.0'],
['6', '5.5', '1.0'],
['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)
df
我想根据 question_id 将其拆分为两个 dfs。也就是说,我希望 80% 的唯一 question_id 在 df1 中,20% 在 df2 中。四舍五入。
上面 df 的虚拟示例:df1 包含 ids 1-5,df2 包含 id 6
df1_data = [['1','hello','1.0'],
['1','hello', '1.0'],
['1','hello', '1.0'],
['2', 'dog', '0.0'],
['2', 'cat', '1.0'],
['2', 'dog', '0.0'],
['2', 'the answer is cat', '1.0'],
['3', 'Milan', '1.0'],
['3', 'Paris', '0.0'],
['3', 'The capital is Paris', '0.0'],
['3', 'MILAN', '1.0'],
['4', 'The capital is Paris', '1.0'],
['4', 'London', '0.0'],
['4', 'Paris', '1.0'],
['4', 'paris', '1.0'],
['5', 'lol', '0.0'],
['5', 'rofl', '0.0']]
df2_data = [['6', '5.5', '1.0'],
['6', '5.2', '0.0']]
答案 0 :(得分:1)
首先获取唯一的问题 id
unique_qid = df['question_id'].unique()
array(['1', '2', '3', '4', '5', '6'], dtype=object)
然后获取前 80% 的唯一问题 ID 并使用相应的布尔索引来获取两个输出 dfs
df1_idx = df['question_id'].isin(unique_qid[:round(0.8 * len(unique_qid))])
df1_data = df.loc[df1_idx, :]
df2_data = df.loc[~df1_idx, :]
df1_data
df2_data