python:根据关键字

时间:2017-03-24 14:26:51

标签: python parsing

我有这个文件:

GSENumber   Species  Platform  Sample  Age  Tissue   Sex       Count
GSE11097    Rat     GPL1355 GSM280267   4   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280268   4   Liver   Female  Count
GSE11097    Rat     GPL1355 GSM280269   6   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280409   6   Liver   Female  Count
GSE11291    Mouse   GPL1261 GSM284967   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284968   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284969   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284970   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284975   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284976   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284987   5   Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284988   5   Muscle  Female  Count
GSE11291    Mouse   GPL1261 GSM284989   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284990   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284991   30  Muscle  Male    Count

你可以看到这里有两个系列(GSE11097和GSE11291),我想要每个系列的摘要;对于每个“GSE”号码,输出应该是这样的字典:

Series      Species  Platform AgeRange Tissue   Sex   Count
GSE11097    Rat     GPL1355     4-6    Liver    Mixed    Count
GSE11291    Mouse   GPL1261     5-10   Heart    Male     Count
GSE11291    Mouse   GPL1261     5-30   Muscle   Mixed    Count

所以我知道一种方法是:

  1. 读入文件并列出所有GSE编号。
  2. 然后再次读入文件并根据GSE编号进行解析。
  3. e.g。

    import sys
    
    list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))
    
    list_of_dicts = []
    for each_list in list_of_series:
        temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
        for line in open(sys.argv[1]).readlines()[1:]:
              line = line.strip().split()
              if line[0] == each_list:
                    temp_dict["species"] = line[1]
                    temp_dict["platform"] = line[2]
                    temp_dict["age"].append(line[4])
                    temp_dict["tissue"] = line[5]
                    temp_dict["sex"].append(line[6])
                    temp_dict["count"] = line[7]
    

    我认为这有两个方面很混乱:

    1. 我要在整个文件中读两次(实际上,文件比例子大得多)

    2. 此方法会使用相同的单词重写相同的词典条目。

    3. 此外,性别存在问题,我想说“如果男性和女性都混淆”,或者说“男性”或“女性”。

      我可以使这段代码工作,但我想知道快速提示使代码更清洁/更pythonic?

2 个答案:

答案 0 :(得分:0)

我同意Max Paymar的说法,这应该用查询语言来完成。如果你真的想用Python做,那么pandas模块将会有很多帮助。

var text = file_open.getBlob().getDataAsString('utf8');

这产生了你要求的结果,并且比用纯Python解析文件更清晰。

答案 1 :(得分:0)

public Object method()
{
    Object objects[] = { a, b, c, d }; // Assuming objects a, b, c and d exist...

    boolean condition1;
    boolean condition2;

    /* 
     * Truth Table
     * 
     *    condtion1  condition2   Object
     *      false      false        d
     *      false      true         c
     *      true       false        b
     *      true       true         a 
     */

    int selector = (condition1 ? 0 : 1) + (condition2 ? 0 : 2);

    return objects[selector];
}