更快的方式在python中运行countifs

时间:2016-11-03 16:41:36

标签: python sql pandas correlated-subquery

我之前问过如何在python中跨多个数据框执行countifs的问题,就像你可以在Excel中的单独工作表上做countifs一样。有人给了我一个非常有创意的答案:

python pandas countifs using multiple criteria AND multiple data frames

谢谢你@ AlexG - 我试过了,它运作得非常好:

import pandas as pd
import numpy as np
import matplotlib as plt

#import the data
students = pd.read_csv("Student Detail stump.csv")
exams = pd.read_csv("Exam Detail stump.csv")

#get data parameters
student_info = students[['Student Number', 'Enrollment Date', 'Detail Date']].values

#prepare an empty list to hold the results
N_exams_passed = []

#count records in data set according to parameters
for s_id, s_enroll, s_qual in student_info:
N_exams_passed.append(len(exams[(exams['Student Number']==s_id) &
                         (exams['Exam Grade Date']>=s_enroll) &
                         (exams['Exam Grade Date']<=s_qual) &
                         (exams['Exam Grade']>=70)])
                      )

#add the results to the original data set
students['Exams Passed'] = N_exams_passed

但是,它只对小型数据集有效。当我使用100,000行来运行数据时,它甚至不会在一夜之间完成。它看起来不是pythonic。

您可以在几秒钟内执行此操作的SQL方法是使用相关子查询,如下所示:

SELECT
   s.*,
   (SELECT COUNT(e.[Exam Grade]) 
 FROM
     exams AS e 
 WHERE
    e.[Exam Grade] >= 65 
    AND e.[Student Number] = s.[Student Number] 
    AND e.[Exam Grade Date] >= s.[Enrollment Date] 
    AND e.[Exam Grade Date] <= s.[Detail Date]) 
    AS ExamsPassed
FROM 
    students AS s;

如何在熊猫或其他pythonic方式中重现这样的相关子查询?

以下是数据框:

 #Students
 Student Number Enroll Date Detail Date
 1              1/1/2016    2/1/2016
 1              1/1/2016    3/1/2016
 2              2/1/2016    3/1/2016
 3              3/1/2016    4/1/2016

 #Exams
 Student Number Exam Date   Exam Grade
 1              1/1/2016    50
 1              1/15/2016   80
 1              1/28/2016   90
 1              2/5/2016    100
 1              3/5/2016    80
 1              4/5/2016    40
 2              2/2/2016    85
 2              2/3/2016    10
 2              2/4/2016    100

最终数据框应该如下所示,并计入“通过考试”。最后:

 #FinalResult
 Student Number Enroll Date Detail Date Passed Exams
 1              1/1/2016    2/1/2016    2
 1              1/1/2016    3/1/2016    3
 2              2/1/2016    3/1/2016    2
 3              3/1/2016    4/1/2016    0

1 个答案:

答案 0 :(得分:0)

如果我理解了数据框架的结构,我建议合并两个数据框,然后使用numpy.where对合并后的数据执行任务。

import numpy as np

exams = exams.merge(students, on='Student Number', how='left')
exams['Passed'] = np.where(
    (exams['Exam Grade Date'] >= exams['Enrollment Date']) &
    (exams['Exam Grade Date'] <= exams['Detail Date']) &
    (exams['Grade'] >= 70),
    1, 0)

students = students.merge(
    exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum().reset_index(),
    left_on=['Student Number', 'Detail Date'],
    right_on=['Student Number', 'Detail Date'],
    how='left')
students['Passed'] = students['Passed'].fillna(0).astype('int')

注意:您需要确保将日期列正确存储为日期时间(您可以使用pandas.to_datetime执行此操作)。

numpy.where创建一个新数组,其值为单向(如上所示的1),如果满足您指定的条件,则另一个(0)如果它们不是&#39见过面。

exams.groupby(['Student Number', 'Detail Date'])['Passed'].sum()生成一系列索引为Student NumberDetail Date的系列,其值为与Student Number和{{1}对应的已通过考试的计数组合。 Detail Date使其成为合并的数据框。