Question

我有一个字符串列表和两个单独的pandas数据帧。数据帧之一包含NaN。我试图找到一种快速的方法来检查列表中的任何项目是否包含在任何一个数据帧中，如果是，则将其从列表中删除。

当前，我通过列表理解来做到这一点。我首先将两个数据帧连接起来。然后，我遍历该列表，并使用if语句检查它是否包含在串联的dataframe值中。

patches = [patch for patch in patches if not patch in bad_patches.values]

我的字符串列表的前5个元素：

patches[1:5]

['S2A_MSIL2A_20170613T101031_11_52',
 'S2A_MSIL2A_20170717T113321_35_89',
 'S2A_MSIL2A_20170613T101031_12_39',
 'S2A_MSIL2A_20170613T101031_11_77']

我的一个数据框的示例，第二个相同，但包含的行较少。注意第一行包含补丁[2]。

cloud_patches.head()

0  S2A_MSIL2A_20170717T113321_35_89

1  S2A_MSIL2A_20170717T113321_39_84

2   S2B_MSIL2A_20171112T114339_0_13

3   S2B_MSIL2A_20171112T114339_0_52

4   S2B_MSIL2A_20171112T114339_0_53

串联的数据框：

bad_patches = pd.concat([cloud_patches, snow_patches], axis=1)
bad_patches.head()

0  S2A_MSIL2A_20170717T113321_35_89  S2B_MSIL2A_20170831T095029_27_76

1  S2A_MSIL2A_20170717T113321_39_84  S2B_MSIL2A_20170831T095029_27_85

2   S2B_MSIL2A_20171112T114339_0_13  S2B_MSIL2A_20170831T095029_29_75

3   S2B_MSIL2A_20171112T114339_0_52  S2B_MSIL2A_20170831T095029_30_75

4   S2B_MSIL2A_20171112T114339_0_53  S2B_MSIL2A_20170831T095029_30_78

和尾部，显示一栏的NaN：

bad_patches.tail()

61702  NaN   S2A_MSIL2A_20180228T101021_43_6

61703  NaN   S2A_MSIL2A_20180228T101021_43_8

61704  NaN  S2A_MSIL2A_20180228T101021_43_11

61705  NaN  S2A_MSIL2A_20180228T101021_43_13

61706  NaN  S2A_MSIL2A_20180228T101021_43_16

（几乎）列标题都被命名为0。

应该删除补丁的第二个元素，因为它包含在bad_patches的第一行中。我的方法确实有效，但是绝对需要花很多时间。 Bad_patches是60,000行，并且补丁的长度是可变的。现在，对于1000个补丁的长度，它需要2.04秒，但是我需要扩展到500k补丁，因此希望有一种更快的方法。谢谢！

Answer 1

我将使用cloud_patches和snow_patches中的值创建一个集合。然后还创建一组patches：

patch_set = set(cloud_patches[0]).union(set(snow_patches[0])
patches = set(patches)

现在，您只需从patch_set中的值中减去patches中的所有值，您将只剩下patches中未出现在cloud_patches中的值也snow_patches：

cleaned_list = list(patches - patch_set)

快速删除列表元素（如果包含在熊猫数据框中）

1 个答案: