在Pandas DataFrame中查找和计算字符串值

时间:2015-05-14 16:39:29

标签: python string pandas

我有一个pandas数据框,其中包含我想要计算的字符串值。我想要计算的字符串是" SYNONYMOUS_CODING"和" NON_SYNONYMOUS_CODING"。我发现这些字符串位于第23,24,25,29和31列。

第23列看起来像这样:

onPostExecute()

第24栏看起来像这样:

15392                                               OAnc=C
15393                                                  114
15394    EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc...
15395                                       0/0:30:90.29:0
15396                                            pSC=0.441
15397                                            pSC=0.030
15398                                              bSC=884
...

第25栏看起来像:

3092    EXON(MODIFIER||||870|RSPH10B|protein_coding|CO...
3093    NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT...
3094                      INTERGENIC(MODIFIER||||||||||1)
3095                      INTERGENIC(MODIFIER||||||||||1)
3096    DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003...
3097    DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003...
3098                                                DP=12
...

第29栏看起来像:

13062                                                      C
13063                                                      C
13064    EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13065    EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13066                                                 CAnc=G
13067                                                      C
13068                                                      G

和第31列看起来像:

15688                                                  0:0
15689                                                  0:0
15690                                                  NaN
15691    EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|...
15692                                                  0:0
15693                                                  NaN
15694                                                  0:1

我想知道如何通过五列并计算字符串的次数" SYNONYMOUS_CODING"或" NON_SYNONYMOUS_CODING"出现没有重复计算。因为可能存在这些字符串出现在两个或更多不同列中的行。

谢谢。

罗德里戈

2 个答案:

答案 0 :(得分:1)

这是我经历过的事情,我包括用于创建数据帧的代码。您可以通过关注main()方法

来查看算法
def create_df():
    grid = (
        {'A': ["EXON(MODIFIER||||870|RSPH10B|protein_coding|CO)",
               "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
               "INTERGENIC(MODIFIER||||||||||1)",
               "DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003)",
               "DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003)"],
         'B': ["FOO",
               "EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc",
               "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
               "pSC=0.441",
               "bSC=884"],
         'C': ["BAR",
               "BAR",
               "EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
               "EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
         'D': ["EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
               "0:0",
               "0:0",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
        }
    )
    return pd.DataFrame(grid)

def get_masks(df):
    non_syn = pd.DataFrame(index=df.index, columns=df.columns)
    synonymous = pd.DataFrame(index=df.index, columns=df.columns)

    for i in df:
        non_syn[i] = df[i].str.contains("NON_SYNONYMOUS_CODING")
        synonymous[i] = df[i][~non_syn[i]].str.contains("SYNONYMOUS_CODING")

    return non_syn, synonymous.dropna()

def count_unique_truths(df):
    # make unique across rows, and then restore to regular
    df = df.transpose().drop_duplicates().transpose()
    return np.sum(df).sum()

def main():
    df = create_df()
    non_syn, synonymous = get_masks(df)
    non_syn_count = count_unique_truths(non_syn)
    synonymous_count = count_unique_truths(synonymous)
    print(df)
    print("Synonymous Count = {:d}\nNon_Synonymous Count = {:d}".format(int(synonymous_count), int(non_syn_count)))
    df.groupby()

if __name__ == '__main__':
    main()

答案 1 :(得分:0)

我可以得到字符串的次数," SYNONOMOUS_CODING"和" NON_SYNONOMOUS_CODING"通过以下方式显示在每列中:

column23 = str(df_test[23])
column24 = str(df_test[24])
column25 = str(df_test[25])
column29 = str(df_test[29])
column31 = str(df_test[31])

count = 0

if "SYNONYMOUS_CODING" in column23:
    print "YES Syn in Column 23"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column24:
    print "YES Syn in Column 24"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column25:
    print "YES Syn in Column 25"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column29:
    print "YES Syn in Column 29"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column31:
    print "YES Syn in Column 31"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column23:
    print "YES Non_Syn in Column 23"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column24:
    print "YES Non_Syn in Column 24"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column25:
    print "YES Non_Syn in Column 25"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column29:
    print "YES Non_Syn in Column 29"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column31:
    print "YES Non_Syn in Column 31"
    count += 1

    print "Count value:"
    print count

但这是高度重复和非pythonic,就像我想要的那样......