我正在消化几个csv文件(每个文件都有一年或多年的数据),将医疗处理分类为大类,同时也只保留原始信息的一部分,甚至汇总到每月数量(按AR =年人和月人均治疗(LopNr)。许多治疗同时属于不同的类别(多个诊断代码列在csv的相关列中,因此我将该字段分成一列列表,并按照属于ICD-9相关范围的任何诊断代码对行进行分类码)。
我正在使用IOPro来节省内存,但我仍然遇到了一个段错误(仍在调查中)。文本文件各为几GB,但本机有256 GB RAM。其中一个软件包是错误的,或者我需要一个更节省内存的解决方案。
我在Linux下使用版本pandas 0.16.2 np19py26_0,iopro 1.7.1 np19py27_p0和python 2.7.10 0。
所以原始数据看起来像这样:
LopNr AR INDATUMA DIAGNOS …
1 2007 20070812 C32 F17
1 2007 20070816 C36
我希望看到这样的聚合:
LopNr AR month tobacco …
1 2007 8 2
顺便说一下,我最后需要Stata dta文件,但是我通过cvs因为pandas.DataFrame.to_stata在我的经历中看起来很不稳定,但也许我也错过了一些东西。
# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *
all_treatments = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']
tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18'))) for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69'))) for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) ) for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71'))) for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) ) for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) ) for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) ) for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) ) for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) ) for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) ) for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) ) for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77'))) for x in lst)
for file in filelist:
filename = 'PATH' + file +'.txt'
adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
treatments['month'] = treatments['INDATUMA'] % 10000
treatments['day'] = treatments['INDATUMA'] % 100
treatments['month'] = (treatments['month']-treatments['day'])/100
del treatments['day']
diagnoses = treatments['DIAGNOS'].str.split(' ')
del treatments['DIAGNOS']
treatments['tobacco'] = diagnoses.map(tobacco)
treatments['nutrition'] = diagnoses.map(nutrition)
treatments['mental'] = diagnoses.map(mental)
treatments['alcohol'] = diagnoses.map(alcohol)
treatments['circulatory'] = diagnoses.map(circulatory)
treatments['dental'] = diagnoses.map(dental)
treatments['accident'] = diagnoses.map(accident)
treatments['selfharm'] = diagnoses.map(selfharm)
treatments['cancer'] = diagnoses.map(cancer)
treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
treatments['pregnancy'] = diagnoses.map(pregnancy)
treatments['other_stress'] = diagnoses.map(other_stress)
all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')
答案 0 :(得分:1)
一些评论:
interface I3{
boolean abc();
};
interface I2{
void abc();
};
public class Example1 implements I3, I2{
@Override
public void abc() {
//Eclipse IDE picked this unimplemented method with the compiler error
}
}
来简化lambda表达式的可读性。示例:
interfacename.constantname
您还可以按如下方式对这些函数进行矢量化:
def
您将def tobacco(codes):
return any( 'C30' <= x < 'C40' or
'F17' <= x < 'F18' for x in codes)
初始化为DataFrame,然后追加到它。这是非常低效的。试试def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code < 'F18'
for code in codes) if codes else False
for codes in codes_column]
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments['tobacco'] = tobacco(diagnoses)
,然后在all_treatments
之前的循环外添加all_treatments = list()
。此外,它应该是all_treatments = pd.concat(all_treatments, ignore_index=True)
(与groupby
)
要计算用于分组的月份,您可以使用:
all_treatments.append(treatments)
最后,不要在读取后将lambda函数应用于每个文件,而是尝试将它们应用于all_treatments = all_treatments.append(treatments)
DataFrame。
P.S。您可能还想在all_treatments['month'] = all_treatments.INDATUMA % 10000 // 100
声明而不是all_treatments
上尝试.sum()
答案 1 :(得分:1)
我认为你需要找到一种方法来矢量化你的解决方案。使用map
和lambda函数效率不高,并没有利用使熊猫如此有用的加速。很难说是肯定的,因为你还没有发布样本数据,但我认为这是一个很好的起点
diagnoses = treatments['DIAGNOS'].str.split(expand=True)
结果将是一个数据框,每个单词(或拆分结果中的元素)都有一列。然后,您可以对整个DataFrame进行矢量化比较。它可能看起来像这样:
between_c_vals = (diagnoses >= 'C30') & (diagnoses <= 'C40')
between_f_vals = (diagnoses >= 'F17') & (diagnoses <= 'F18')
treatment['tobacco'] = (between_c_vals | between_f_vals).any(axis=1)
这比使用Python中使用循环的.map
要快几百倍。请注意,位运算符&
和|
可用于对布尔向量和矩阵(或DataFrame)执行集合逻辑。
如果您展示了treatment['DIAGNOS']
的示例,我可以提供更多帮助。在进行比较时要注意的一点是NaN
值,因为将NaN
与任何事物进行比较总会返回False
,但我认为它应该没问题,因为它不会返回任何不需要的内容真正的价值观