使用Python / numpy过滤CSV数据

时间:2014-01-09 04:25:18

标签: python csv numpy lambda pandas

我正在处理CSV文件。

            id     gender       disease       read      write    science 
  1.        11       male      cancer, diabetes 34         46         39  
  2.        20       male      diabetes         60         52         61  
  3.        12       male      diabetes         37         44         39  
  4.        16       male      cancer           47         31         36  
  5.         7       male      diabetes         57         54         47  
  6.        21       male      diabetes         44         44         50  
  7.        15       male      diabetes         39         39         26  
  8.        22       male      diabetes         42         39         56  
  9.         9       male      cancer           48         49         44  
 10.        18       male      diabetes         50         33         44  
 11.         5       male      diabetes         47         40          .  
 12.        14       male      diabetes         47         41         42  
 13.         3       male      diabetes         63         65         63  
 14.        24       male         fever         52         62         47  
 15.         8     female      diabetes         39         44         44  
 16.         1     female      cancer           34         44         39  
 17.         4     female      diabetes         44         50         39  
 18.         2     female      diabetes         39         41         42  
 19.        19     female      cancer           28         46         44  
 20.        17     female      diabetes         47         57         44  
 21.         6     female      diabetes         47         41         40  
 22.        10     female      diabetes         47         54         53  
 23.        13     female      diabetes         47         46         47  
 24.        23     female      diabetes         65         65         58  
 25.        25     female    Breast cancer         47         44         42  

我想得到人们患癌症的所有行。有些人患有糖尿病和癌症,所以我也要过滤掉它。 结果应该是:

1.         11       male      cancer, diabetes 34         46         39  
4.         16       male      cancer           47         31         36
9.         9       male      cancer           48         49         44  
19.        19     female      cancer           28         46         44 
25.        25     female    Breast cancer         47         44         42


import pandas as pd                     
import numpy as np

ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta= pd.DataFrame.from_records(ppl_ve_cancer )
disease= delta['disease']

现在,我如何过滤“疾病列表”,过滤后,如何获取行中的数据(id,性别,读,写,科学)

3 个答案:

答案 0 :(得分:2)

这是一种更加以熊猫为中心的方式:首先,您将所有数据作为数据框读取,创建has cancer列,然后对其进行过滤=

import StringIO
import pandas

datastring = StringIO.StringIO("""\
id,gender,disease,read,write,science
11,male,"cancer,diabetes",34,46,39
20,male,diabetes,60,52,61
12,male,diabetes,37,44,39
16,male,cancer,47,31,36
7,male,diabetes,57,54,47
21,male,diabetes,44,44,50
15,male,diabetes,39,39,26
22,male,diabetes,42,39,56
9,male,cancer,48,49,44
18,male,diabetes,50,33,44
5,male,diabetes,47,40,-999
14,male,diabetes,47,41,42
3,male,diabetes,63,65,63
24,male,fever,52,62,47
8,female,diabetes,39,44,44
1,female,cancer,34,44,39
4,female,diabetes,44,50,39
2,female,diabetes,39,41,42
19,female,cancer,28,46,44
17,female,diabetes,47,57,44
6,female,diabetes,47,41,40
10,female,diabetes,47,54,53
13,female,diabetes,47,46,47
23,female,diabetes,65,65,58
25,female,"Breast cancer",47,44,42
""")

df = pandas.read_csv(datastring, na_values=-999)

# create the `has cancer` column
df['has cancer'] = df.disease.apply(lambda row: 'cancer' in row)

# print the filtered data
print(df[df['has cancer']].to_string())


    id  gender          disease  read  write  science has cancer
0   11    male  cancer,diabetes    34     46       39       True
3   16    male           cancer    47     31       36       True
8    9    male           cancer    48     49       44       True
15   1  female           cancer    34     44       39       True
18  19  female           cancer    28     46       44       True
24  25  female    Breast cancer    47     44       42       True

答案 1 :(得分:2)

answer将完全符合您的要求。您只需要df[df['A'].str.contains("hello")]

之类的内容
import pandas as pd                     
import numpy as np

ppl_ve_cancer = pd.read_csv(join(dirname(__file__), 'data.csv'))
delta = pd.DataFrame.from_records(ppl_ve_cancer )
query = delta['disease'].str.contains('cancer')
delta_filtered = delta[query]
print delta_filtered

答案 2 :(得分:0)

这将获取您的CSV文件,按照癌症中的行过滤它,并生成您可以使用的变量或存储以供日后使用。

with open("input.csv") as I:
    for line in I:
        if "cancer" in line: #get lines with Cancer
            line = line.replace("\n","") #filter out new line symbols
            pid,gender,disease,read,write,science = line.split('\t') #split lines by tabs then assign to separate variables for later use
            print pid,gender,disease,read,write,science

输入:

id  gender  disease          read    write   science
11  male    cancer, diabetes 34  46  39
20  male    diabetes     60  52  61
12  male    diabetes     37  44  39
16  male    cancer           47  31  36

输出:

11 male cancer, diabetes 34 46 39
16 male cancer           47 31 36