我有一个用Pandas用Python编写的数据管道。它正在工作,我最喜欢它。使用Python相当不错。在这一点上,与熊猫的关系不太好。但是,作为Python的学生,我一直希望自己更加习惯用法,而不是依赖于其他语言中可能使用的循环和逻辑。具体来说,我总是对使用Pandas方法可以迭代序列和框架感到惊讶,而不必围绕它们构建循环。
这是源文件。我还具有来自API调用的其他受众特征信息。该代码已在我的示例中被截断,但是假设data ['record']具有我的受众特征数据,并且我可以使用pd.merge()成功合并它们。
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|StudentFirstName|StudentMiddleName|StudentLastName|UniqueIdentifier|Grade|GOM |SchoolYear|Fall_September|Fall_SeptemberDateGiven|Fall_SeptemberNationalPercentileRank|Winter_January|Winter_JanuaryDateGiven|Winter_JanuaryNationalPercentileRank|Spring_May|Spring_MayDateGiven|Spring_MayNationalPercentileRank|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | |100 |1 |LNF |2017 |29 |9/11/2017 |14 | | | | | | |
| | | |100 |1 |LSF |2017 |31 |9/11/2017 |51 | | | | | | |
| | | |100 |1 |M-COMP|2017 |8 |9/20/2017 |48 |15 |2/5/2018 |17 |42 |5/8/2018 |65 |
| | | |100 |1 |MNM |2017 |9 |9/11/2017 |36 |7 |2/1/2018 |3 |5 |5/8/2018 |1 |
| | | |100 |1 |NIM |2017 |35 |9/11/2017 |34 |62 |2/1/2018 |52 |51 |5/8/2018 |18 |
| | | |100 |1 |NWF |2017 |28 |9/11/2017 |37 |69 |2/1/2018 |71 |31 |5/8/2018 |5 |
| | | |100 |1 |OCM |2017 |62 |9/11/2017 |30 |89 |2/1/2018 |58 |94 |5/8/2018 |51 |
| | | |100 |1 |PSF |2017 |14 |9/11/2017 |10 | | | |49 |5/8/2018 |33 |
| | | |100 |1 |QDM |2017 |23 |9/11/2017 |57 |31 |2/1/2018 |46 |26 |5/8/2018 |15 |
| | | |100 |1 |R-CBM |2017 |8 |9/11/2017 |36 |17 |2/1/2018 |22 |29 |5/8/2018 |15 |
| | | |200 |1 |LNF |2017 |47 |9/11/2017 |51 | | | | | | |
| | | |200 |1 |LSF |2017 |47 |9/11/2017 |86 | | | | | | |
| | | |200 |1 |M-COMP|2017 |27 |9/22/2017 |92 |34 |2/2/2018 |67 |47 |5/8/2018 |88 |
| | | |200 |1 |MNM |2017 |11 |9/11/2017 |48 |23 |2/1/2018 |80 |21 |5/9/2018 |55 |
| | | |200 |1 |NIM |2017 |56 |9/11/2017 |81 |80 |2/1/2018 |95 |80 |5/9/2018 |92 |
| | | |200 |1 |NWF |2017 |63 |9/11/2017 |87 | | | | | | |
| | | |200 |1 |OCM |2017 |107 |9/11/2017 | |109 |2/1/2018 | |109 |5/9/2018 | |
| | | |200 |1 |PSF |2017 |50 |9/11/2017 |73 | | | | | | |
| | | |200 |1 |QDM |2017 |28 |9/11/2017 |75 |38 |2/1/2018 |78 |40 |5/9/2018 |84 |
| | | |200 |1 |R-CBM |2017 |40 |9/11/2017 |80 |76 |2/1/2018 |80 |84 |5/9/2018 |65 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
由于要进行的操作,我需要将其拆分为两个文件,一个用于读取数据,另一个用于数学数据。进入哪个文件由GOM列确定。大约一半的GOM值用于阅读,另一半用于数学。
这是两个输出文件的样子。阅读:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date |LNF 1|LNF 2|LNF 3 |LSF 1|LSF 2|LSF 3 |PSF 1|PSF 2|PSF 3 |NWF 1|NWF 2|NWF 3 |R-CBM 1|R-CBM 2|R-CBM 3|MAZE 1|MAZE 2|MAZE 3|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | | |1 |Fall |9/11/2017 |29 |14 |Level 2 |31 |51 |Level 3 |14 |10 |Level 1 |28 |37 |Level 3|8 |36 |Level 3| | | |
|100 | | |1 |Spring|5/8/2018 | | | | | | |49 |33 |Level 3 |31 |5 |Level 1|29 |15 |Level 2| | | |
|100 | | |1 |Winter|2/1/2018 | | | | | | | | | |69 |71 |Level 3|17 |22 |Level 2| | | |
|200 | | |1 |Fall |9/11/2017 |47 |51 |Level 3 |47 |86 |Level 4 |50 |73 |Level 3 |63 |87 |Level 4|40 |80 |Level 4| | | |
|200 | | |1 |Spring|5/9/2018 | | | | | | | | | | | | |84 |65 |Level 3| | | |
|200 | | |1 |Winter|2/1/2018 | | | | | | | | | | | | |76 |80 |Level 4| | | |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
和数学:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date |OCM 1|OCM 2|OCM 3 |NIM 1|NIM 2|NIM 3 |QDM 1|QDM 2|QDM 3 |MNM 1|MNM 2|MNM 3 |M-COMP 1|M-COMP 2|M-COMP 3|M-CAP 1|M-CAP 2|M-CAP 3|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | | |1 |Fall |9/11/2017|62 |30 |Level 3|35 |34 |Level 3|23 |57 |Level 3|9 |36 |Level 3|8 |48 |Level 3 | | | |
|100 | | |1 |Spring|5/8/2018 |94 |51 |Level 3|51 |18 |Level 2|26 |15 |Level 2|5 |1 |Level 1|42 |65 |Level 3 | | | |
|100 | | |1 |Winter|2/1/2018 |89 |58 |Level 3|62 |52 |Level 3|31 |46 |Level 3|7 |3 |Level 1|15 |17 |Level 2 | | | |
|200 | | |1 |Fall |9/11/2017|107 | | |56 |81 |Level 4|28 |75 |Level 3|11 |48 |Level 3|27 |92 |Level 5 | | | |
|200 | | |1 |Spring|5/8/2018 |109 | | |80 |92 |Level 5|40 |84 |Level 4|21 |55 |Level 3|47 |88 |Level 4 | | | |
|200 | | |1 |Winter|2/1/2018 |109 | | |80 |95 |Level 5|38 |78 |Level 4|23 |80 |Level 4|34 |67 |Level 3 | | | |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
最棘手的事情是为每个主题找到合适的日期。孩子们大约在同一天参加每个GOM,但是出来的文件为每个GOM提供了不同的日期。当数据进入我们的其他系统时,每个主题只有一个日期,由多个GOM组成。我求助于编写一个函数来遍历它们,但这可能更容易实现。
通常,我将一次只运行一个脚本,然后注释掉其他脚本,但是现在我要导入去年的数据,并且一次执行所有三个季节。
关于输出列的快速说明。对于每个GOM,1是数字分数,2是百分位数,3是性能水平的字母描述。这是组织文件的一种可怕方式,但这正是供应商所需要的。
我希望能够以更少的步骤或至少以一种更惯用的方式来实现这一目标。我发现自己撤消了一些pivot_table产生的东西,只是我可以用我知道的方式再次与df一起工作,这仍然很有限。
我最大的担忧是我在第65-77行融化了数据透视表。不知道这是否可以更有效地完成。而且,从Python的角度来看,第108-111行确实很愚蠢。我只是不想第二次列出GOM,我希望快速生成LNF 1,LNF 2,LNF 3,LSF 1 ...的列表,而无需重新键入。必须有一种更好的方法。
请告诉我您是否对如何改进此过程有任何建议,无论是纯Python还是(尤其是熊猫)。
import pandas as pd
import os
import json
import glob
import re
import datetime
import numpy as np
import itertools
# build of list of seasons to process. usually only one.
seasons = []
seasons.append('Fall')
seasons.append('Winter')
seasons.append('Spring')
# Read demographic data into dataframe
# data['record'] comes from something that was truncated for simplicity
demographics = pd.DataFrame(data['record'], columns=['student_number', 'state_studentnumber', 'last_name', 'first_name', 'dob'])
# iterate through score files
for file in glob.glob('To Do/*.csv'):
# read score data into dataframe
aimsweb = pd.read_csv(file)
aimsweb["Fall_SeptemberDateGiven"] = pd.to_datetime(aimsweb['Fall_SeptemberDateGiven']).dt.strftime('%m/%d/%Y');
aimsweb["Winter_JanuaryDateGiven"] = pd.to_datetime(aimsweb['Winter_JanuaryDateGiven']).dt.strftime('%m/%d/%Y');
aimsweb["Spring_MayDateGiven"] = pd.to_datetime(aimsweb['Spring_MayDateGiven']).dt.strftime('%m/%d/%Y');
# perform left outer join of score data to demographics. this ensures these are real students.
merged = aimsweb.merge(demographics, how='left', left_on=['UniqueIdentifier', 'StudentLastName', 'StudentFirstName'], right_on=['student_number', 'last_name', 'first_name'], indicator=True)
renames = {}
# rename table
renames['Fall_September'] = 'Fall_1'
renames['Fall_SeptemberDateGiven'] = 'Fall_Date'
renames['Fall_SeptemberNationalPercentileRank'] = 'Fall_2'
renames['Winter_January'] = 'Winter_1'
renames['Winter_JanuaryDateGiven'] = 'Winter_Date'
renames['Winter_JanuaryNationalPercentileRank'] = 'Winter_2'
renames['Spring_May'] = 'Spring_1'
renames['Spring_MayDateGiven'] = 'Spring_Date'
renames['Spring_MayNationalPercentileRank'] = 'Spring_2'
merged.rename(index=str, columns=renames, inplace=True)
# function for converting percentile into level 1-5
def percentile2level(percentile):
if percentile >= 91:
return 'Level 5'
if percentile >= 76:
return 'Level 4'
if percentile >= 26:
return 'Level 3'
if percentile >= 11:
return 'Level 2'
if percentile >= 0:
return 'Level 1'
# convert percentile to level
for season in ['Fall', 'Winter', 'Spring']:
merged[season + '_3'] = merged[season + '_2'].apply(percentile2level);
# melt data from columns into rows
merged = merged.melt(id_vars = ['student_number', 'GOM', 'last_name', 'first_name', 'Grade'],value_vars=['Fall_Date', 'Fall_1', 'Fall_2', 'Fall_3', 'Winter_Date', 'Winter_1', 'Winter_2', 'Winter_3', 'Spring_Date', 'Spring_1', 'Spring_2', 'Spring_3'])
# split the variable into season and attribute
merged['Season'], merged['Attribute'] = merged['variable'].str.split('_', 1).str
# pivot data back to columns
merged = pd.pivot_table(merged, index=['student_number', 'Season', 'last_name', 'first_name', 'Grade'], values=['value'], columns=['GOM', 'Attribute'], aggfunc=np.max, fill_value='')
# condense the levels and reset index to get back to flat df
merged.columns = merged.columns.droplevel(0)
merged.columns = [' '.join(col).strip() for col in merged.columns.values]
merged.reset_index(inplace=True);
# build subject-GOM mappings
subjects = {};
subjects['Reading'] = ['LNF', 'LSF', 'PSF', 'NWF', 'R-CBM', 'MAZE']
subjects['Mathematics'] = ['OCM', 'NIM', 'QDM', 'MNM', 'M-COMP', 'M-CAP']
# function for finding the min non-null date from a list
def mindate(row, headings):
dates = list(row[headings])
dates = list(filter(None, dates))
if len(dates) > 0:
return min(dates)
return "";
# iterate through two subjects
for subject in subjects:
# build list of date headers for all GOMs
headings = [];
for gom in subjects[subject]:
headings.append(gom + " Date");
# create df for this subject
df = merged
# create date column with value of minimum from list of dates
df['Date'] = df.apply(mindate, axis=1, args=(headings,))
# filter out records with no real date
df = df[ df['Date'] != 'NaT' ]
# build a list of GOM headers with 1, 2, 3
headings = [];
for combo in list(itertools.product( subjects[subject], [1,2,3] )):
headings.append( combo[0] + " " + str(combo[1]) )
# save csv
df[ df['Season'].isin(seasons) ][ ['student_number', 'last_name', 'first_name', 'Grade', 'Season', 'Date'] + headings].to_csv("./" + os.path.splitext(os.path.basename(file))[0] + "_" + "".join(seasons) + "_" + subject + ".csv", index=False);