熊猫查询非常慢

时间:2019-05-03 07:22:35

标签: python pandas

我有以下代码,该代码读取一个csv文件,然后对其进行分析。一名患者患有多种疾病,我需要确定所有患者均患有疾病的次数。但是这里给出的查询

raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size

是如此缓慢,以至于需要超过15分钟的时间。有没有一种方法可以使查询更快?

raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')

data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]

illnesses = pd.DataFrame({"Finding_Label":[], 
                     "Count_of_Patientes_Having":[],
                         "Count_of_Times_Being_Shown_In_An_Image":[]}) 

ids = raw_data["Patient ID"].drop_duplicates()

index = 0

for ctr in data[:1]:
    illnesses.at[index, "Finding_Label"] = ctr
    illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
    for i in ids:
        illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
    index = index + 1

部分数据框:

原始数据

查找标签-患者ID

疾病A |疾病B-1

疾病A-2

2 个答案:

答案 0 :(得分:0)

据我了解,ctr代表疾病的名称。

执行此查询时:

raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size

您不仅要过滤具有该疾病的行,而且还要过滤具有特定患者ID的行。如果您有很多病人,则需要多次查询。一种更简单的方法是不对患者编号进行过滤,然后对所有患有该疾病的行进行计数。 这将是:

raw_data[raw_data['Finding Labels'].str.contains(ctr)].size

在这种情况下,由于您需要的是行数,因此您正在寻找的是len而不是size(大小将是数据框中的单元格数目)。

最后,当前代码中的另一个错误源是您没有保留每个患者ID的计数。您需要递增illnesses.at[index, "Count_of_Patientes_Having"]而不是每次都将其设置为新值。

假设您想将疾病名称和索引分开,代码将类似于(对于最后几行):

for index, ctr in enumerate(data[:1]):
    illnesses.at[index, "Finding_Label"] = ctr
    illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
    illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])

我放任使用enumerate来使用更Python的方式处理索引。我也不是很清楚"Count_of_Times_Being_Shown_In_An_Image"是什么,但是我假设您在sizelen之间有同样的困惑。

答案 1 :(得分:0)

就像您的代码之所以缓慢的原因是,您正在循环内逐行扩展数据帧,该循环可能涉及多个内存中复制。通常,这让人想起通用Python,而不是Pandas编程,后者理想地以块方式进行矢量化处理。

考虑将您的数据(假设数据大小合理)与疾病列表进行交叉连接,以将查找标签与同一行中的每种疾病对齐,如果较长的字符串包含较短的项,则将其过滤。然后,运行几个groupby()以返回计数和病人的非重复计数。

# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
                    .merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
                    .drop(columns=['key'])
            )

# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]

# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
    return (grp.groupby('Patient ID').size()).size

illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
                          'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})

为了演示,请在下面考虑带有随机种子的输入数据和输出。

输入数据(试图镜像原始数据)

import numpy as np
import pandas as pd

alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']

ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", 
        "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", 
        "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]

np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
                         'Finding Labels': np.core.defchararray.add(
                                              np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
                                                                       np.random.choice(ills, 25).astype('str')),
                                              np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
                         })

print(raw_data.head(10))    
#   Patient ID       Finding Labels
# 0          r   xPNPneumothoraxXYm
# 1     python   ScSInfiltration9Ud
# 2      stata   tJhInfiltrationJtG
# 3          r      thLPneumoniaWdr
# 4      stata    thYAtelectasis6iW
# 5        sas      2WLPneumonia1if
# 6      julia  OPEConsolidationKq0
# 7        sas   UFFCardiomegaly7wZ
# 8      stata         9NQHerniaMl4
# 9     python         NB8HerniapWK

输出 (运行上述过程之后)

print(illnesses)
#                     Count_of_Times_Being_Shown_In_An_Image  Count_of_Patients_Having
# ills                                                                                
# Atelectasis                                              3                         1
# Cardiomegaly                                             2                         1
# Consolidation                                            1                         1
# Effusion                                                 1                         1
# Emphysema                                                1                         1
# Fibrosis                                                 2                         2
# Hernia                                                   4                         3
# Infiltration                                             2                         2
# Mass                                                     1                         1
# Nodule                                                   2                         2
# Pleural_Thickening                                       1                         1
# Pneumonia                                                3                         3
# Pneumothorax                                             2                         2