Question

我在一个名为df_A的熊猫中有一个数据框，该数据框实时具有100多个列。

而且，我还有另一个数据框df_B，其中两列为我提供了df_A中需要的列

下面给出了一个可复制的示例，

import pandas as pd

d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555],'foo2':[110, 101, 222], 
     'bar2':[333, 444, 555],'foo3':[100, 111, 222], 
     'bar3':[333, 444, 555]}

df_A = pd.DataFrame(d)

d = {'ReqCol_A':['foo','foo2'], 
     'bar':[333, 444],'foo2':[100, 111], 
     'bar2':[333, 444],'ReqCol_B':['bar3', ''], 
     'bar3':[333, 444]}

df_b = pd.DataFrame(d)

在上面的示例中可以看到df_b，ReqCol_A和ReqCol_B下的值是我试图从df_A中获取的值

因此，我的预期输出将包含df_A中的三列。这三列将是foo foo2和bar3。

df_C将是预期的输出，看起来像

df_C
foo foo2 bar3
100 110  333
111 101  444
222 222  555

请帮助我。我正在努力做到这一点。

Answer 1

尝试使用filter仅获取带有'ReqCol'的列，然后使用stack获取列表并过滤db_A数据帧：

df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]

输出：

   foo  bar3  foo2
0  100   333   100
1  111   444   111
2  222   555   222

Answer 2

解决方案：

# retrieve all the unique elements from your df_b columns (ReqCol_A and ReqCol_B) let it also include nan and other unwanted features
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())

# Taking intersection with df_A column names and fetching the names which need to be targeted
target_features = set(df_A.columns) & features

# Get the Output
df_A.loc[:,target_features]

性能比较

给出方法：

%%timeit
features = set(df_b.ReqCol_A.unique()) | set(df_b.ReqCol_B.unique())
target_features = set(df_A.columns) & features
df_A.loc[:,target_features]
875 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

第二个答案（使用过滤器）：

%%timeit 
df_A[df_b.filter(like='ReqCol').replace('', np.nan).stack().tolist()]
2.14 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

显然，给定的方法比其他方法快得多。

熊猫-根据其他数据框列中的值删除列

2 个答案: