我正在寻找用Python(尤其是Pandas)替换Excel中使用的函数的替代方法。函数之一是COUNTIFS(),我主要使用该函数在固定范围内定位特定的行值。主要用于确定一列中的特定值是否存在于另一列中。
Excel中的示例如下所示:
第一行的代码(列:col1_in_col2):
= COUNTIFS($ B $ 2:$ B $ 6,A2)
我试图在Pandas中重新创建函数,只是区别在于可以在两个不同的DataFrame中找到这两列,并且DataFrame在字典中(bigdict)。代码如下:
import pandas as pd
bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}
bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)
在上面的示例中,第一行的返回值应为零,而其他行的返回值应为1,因为可以在其他DataFrame的列中找到它。但是,每行的返回值为0。
答案 0 :(得分:3)
尝试一下。我将您的字典拆成两个数据框并比较了它的值。
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
答案 1 :(得分:1)
这里是使用列表理解的方法:
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
for x in bigdict['df1']['col1']]
输出:
col1 df1_in_df2
0 0110200_2016 0
1 011037_2016 1
2 011037_2016 1
3 0111054_2016 1
答案 2 :(得分:1)
这基本上与@Ashwini的答案相同,但是您摆脱了np.where
和iloc
的使用,这可以使其更具可读性,并最终变得更快。
import pandas as pd
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
"011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
"011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
更新
在这里,我尝试比较@ vlemaistre,@ Ashwini,@ SamLegesse和我的4种方法。
import pandas as pd
import numpy as np
# create fake data
n = int(1e6)
n1 = int(1e4)
df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]
df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)
# backup dataframe
df0 = df1.copy()
bigdict = {"df1": df1, "df2": df2}
%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
for x in bigdict['df1']['col1']]
CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s
def countif(x,col):
if x in col.values:
return 1
else:
return 0
return
df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s
df1 = df0.copy()
%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms
这与Ashwini的解决方案完全一样
df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms
矢量方法比使用apply
的方法至少快1684倍。
答案 3 :(得分:0)
在我看来,最简单的方法是制作一个通用函数,您可以在想要执行excel countif()等效的任何时候应用它。
import pandas as pd
def countif(x,col):
if x in col.values:
return 1
else:
return 0
return
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
编辑:
正如评论中提到的rpanai一样,随着数据的增长,apply存在性能问题。使用numpy向量化将大大提高性能。这是Ashwini答案的修改版本。
import pandas as pd
import numpy as np
def countif(df1, df2, col1, col2, name):
df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
countif(df1,df2,'col1','col1','df1_in_df2')
print(df1)
# col1 df1_in_df2
# 0 0110200_2016 0
# 1 011037_2016 1
# 2 011037_2016 1
# 3 0111054_2016 1