Question

假设我有2个尺寸为private static Map<Integer, String> sorted() { return map.entrySet() .stream() .sorted(Comparator.comparing(Entry::getValue)) .collect(Collectors.toMap(Entry::getKey, Entry::getValue, (e1, e2) -> e1, LinkedHashMap::new)); }的熊猫数据帧Map<Object, String>和尺寸为df的{{1}}。 297232 x 122已经是df_raw的子集。两个数据帧的索引均为840380x122。我想从df的值中提取df_raw的值，从DateTime的值中提取70%的值（可以根据需要随机采样），同时确保采样的数据帧子集可以在索引方面没有重叠。

更准确地说，df将从30%中随机选择df_raw个值，而df_subset从{{1}中随机选择70%个值}，但是df和df_raw_subset不应在所采样的行中包含重叠，即它们应具有唯一的30%索引。

Answer 1

所以我们首先从df sample开始，因为大小很小，将来我们再从另一个更大的df删除它时，就不会出现问题：没有足够的数据指向{{1} }

sample

然后我们将df_sub=df.sample(frac=0.7, replace=False)中的索引删除df_raw

df_sub

从熊猫数据框中提取子集，确保没有重叠？

1 个答案: