在Python(pandas)中改进映射的lambdas

时间:2015-07-26 13:31:13

标签: python csv pandas lambda

我正在消化几个csv文件(每个文件都有一年或多年的数据),将医疗处理分类为大类,同时也只保留原始信息的一部分,甚至汇总到每月数量(按AR =年人和月人均治疗(LopNr)。许多治疗同时属于不同的类别(多个诊断代码列在csv的相关列中,因此我将该字段分成一列列表,并按照属于ICD-9相关范围的任何诊断代码对行进行分类码)。

我正在使用IOPro来节省内存,但我仍然遇到了一个段错误(仍在调查中)。文本文件各为几GB,但本机有256 GB RAM。其中一个软件包是错误的,或者我需要一个更节省内存的解决方案。

我在Linux下使用版本pandas 0.16.2 np19py26_0,iopro 1.7.1 np19py27_p0和python 2.7.10 0。

所以原始数据看起来像这样:

LopNr   AR INDATUMA DIAGNOS …
1     2007 20070812 C32 F17
1     2007 20070816     C36

我希望看到这样的聚合:

LopNr   AR month tobacco …
1     2007     8       2

顺便说一下,我最后需要Stata dta文件,但是我通过cvs因为pandas.DataFrame.to_stata在我的经历中看起来很不稳定,但也许我也错过了一些东西。

# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *

all_treatments  = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']

tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18')))  for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69')))  for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) )  for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71')))  for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) )  for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) )  for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) )  for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) )  for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) )  for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) )  for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) )  for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77')))  for x in lst)

for file in filelist:
    filename = 'PATH' + file +'.txt'
    adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
    treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
    treatments['month'] = treatments['INDATUMA'] % 10000
    treatments['day'] = treatments['INDATUMA'] % 100
    treatments['month'] = (treatments['month']-treatments['day'])/100  
    del treatments['day']
    diagnoses = treatments['DIAGNOS'].str.split(' ')
    del treatments['DIAGNOS']
    treatments['tobacco'] = diagnoses.map(tobacco)
    treatments['nutrition'] = diagnoses.map(nutrition)
    treatments['mental'] = diagnoses.map(mental)
    treatments['alcohol'] = diagnoses.map(alcohol)
    treatments['circulatory'] = diagnoses.map(circulatory)
    treatments['dental'] = diagnoses.map(dental)
    treatments['accident'] = diagnoses.map(accident)
    treatments['selfharm'] = diagnoses.map(selfharm)
    treatments['cancer'] = diagnoses.map(cancer)
    treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
    treatments['pregnancy'] = diagnoses.map(pregnancy)
    treatments['other_stress'] = diagnoses.map(other_stress)
    all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')

2 个答案:

答案 0 :(得分:1)

一些评论:

  1. 如上所述,您应该使用interface I3{ boolean abc(); }; interface I2{ void abc(); }; public class Example1 implements I3, I2{ @Override public void abc() { //Eclipse IDE picked this unimplemented method with the compiler error } } 来简化lambda表达式的可读性。
  2. 示例:

    interfacename.constantname

    您还可以按如下方式对这些函数进行矢量化:

    def
    1. 您将def tobacco(codes): return any( 'C30' <= x < 'C40' or 'F17' <= x < 'F18' for x in codes) 初始化为DataFrame,然后追加到它。这是非常低效的。试试def tobacco(codes_column): return [any('C30' <= code < 'C40' or 'F17' <= code < 'F18' for code in codes) if codes else False for codes in codes_column] diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist() all_treatments['tobacco'] = tobacco(diagnoses) ,然后在all_treatments之前的循环外添加all_treatments = list()。此外,它应该是all_treatments = pd.concat(all_treatments, ignore_index=True)(与groupby

    2. 要计算用于分组的月份,您可以使用:

      all_treatments.append(treatments)

    3. 最后,不要在读取后将lambda函数应用于每个文件,而是尝试将它们应用于all_treatments = all_treatments.append(treatments) DataFrame。

    4. P.S。您可能还想在all_treatments['month'] = all_treatments.INDATUMA % 10000 // 100声明而不是all_treatments上尝试.sum()

答案 1 :(得分:1)

我认为你需要找到一种方法来矢量化你的解决方案。使用map和lambda函数效率不高,并没有利用使熊猫如此有用的加速。很难说是肯定的,因为你还没有发布样本数据,但我认为这是一个很好的起点

diagnoses = treatments['DIAGNOS'].str.split(expand=True)

结果将是一个数据框,每个单词(或拆分结果中的元素)都有一列。然后,您可以对整个DataFrame进行矢量化比较。它可能看起来像这样:

between_c_vals = (diagnoses >= 'C30') & (diagnoses <= 'C40')
between_f_vals = (diagnoses >= 'F17') & (diagnoses <= 'F18')
treatment['tobacco'] = (between_c_vals | between_f_vals).any(axis=1)

这比使用Python中使用循环的.map要快几百倍。请注意,位运算符&|可用于对布尔向量和矩阵(或DataFrame)执行集合逻辑。 如果您展示了treatment['DIAGNOS']的示例,我可以提供更多帮助。在进行比较时要注意的一点是NaN值,因为将NaN与任何事物进行比较总会返回False,但我认为它应该没问题,因为它不会返回任何不需要的内容真正的价值观