我有一个pandas数据框,其中包含我想要计算的字符串值。我想要计算的字符串是" SYNONYMOUS_CODING"和" NON_SYNONYMOUS_CODING"。我发现这些字符串位于第23,24,25,29和31列。
第23列看起来像这样:
onPostExecute()
第24栏看起来像这样:
15392 OAnc=C
15393 114
15394 EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc...
15395 0/0:30:90.29:0
15396 pSC=0.441
15397 pSC=0.030
15398 bSC=884
...
第25栏看起来像:
3092 EXON(MODIFIER||||870|RSPH10B|protein_coding|CO...
3093 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT...
3094 INTERGENIC(MODIFIER||||||||||1)
3095 INTERGENIC(MODIFIER||||||||||1)
3096 DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003...
3097 DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003...
3098 DP=12
...
第29栏看起来像:
13062 C
13063 C
13064 EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13065 EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13066 CAnc=G
13067 C
13068 G
和第31列看起来像:
15688 0:0
15689 0:0
15690 NaN
15691 EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|...
15692 0:0
15693 NaN
15694 0:1
我想知道如何通过五列并计算字符串的次数" SYNONYMOUS_CODING"或" NON_SYNONYMOUS_CODING"出现没有重复计算。因为可能存在这些字符串出现在两个或更多不同列中的行。
谢谢。
罗德里戈
答案 0 :(得分:1)
这是我经历过的事情,我包括用于创建数据帧的代码。您可以通过关注main()方法
来查看算法def create_df():
grid = (
{'A': ["EXON(MODIFIER||||870|RSPH10B|protein_coding|CO)",
"NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
"INTERGENIC(MODIFIER||||||||||1)",
"DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003)",
"DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003)"],
'B': ["FOO",
"EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc",
"NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
"pSC=0.441",
"bSC=884"],
'C': ["BAR",
"BAR",
"EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
"EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
"EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
'D': ["EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
"0:0",
"0:0",
"EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
"EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
}
)
return pd.DataFrame(grid)
def get_masks(df):
non_syn = pd.DataFrame(index=df.index, columns=df.columns)
synonymous = pd.DataFrame(index=df.index, columns=df.columns)
for i in df:
non_syn[i] = df[i].str.contains("NON_SYNONYMOUS_CODING")
synonymous[i] = df[i][~non_syn[i]].str.contains("SYNONYMOUS_CODING")
return non_syn, synonymous.dropna()
def count_unique_truths(df):
# make unique across rows, and then restore to regular
df = df.transpose().drop_duplicates().transpose()
return np.sum(df).sum()
def main():
df = create_df()
non_syn, synonymous = get_masks(df)
non_syn_count = count_unique_truths(non_syn)
synonymous_count = count_unique_truths(synonymous)
print(df)
print("Synonymous Count = {:d}\nNon_Synonymous Count = {:d}".format(int(synonymous_count), int(non_syn_count)))
df.groupby()
if __name__ == '__main__':
main()
答案 1 :(得分:0)
我可以得到字符串的次数," SYNONOMOUS_CODING"和" NON_SYNONOMOUS_CODING"通过以下方式显示在每列中:
column23 = str(df_test[23])
column24 = str(df_test[24])
column25 = str(df_test[25])
column29 = str(df_test[29])
column31 = str(df_test[31])
count = 0
if "SYNONYMOUS_CODING" in column23:
print "YES Syn in Column 23"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column24:
print "YES Syn in Column 24"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column25:
print "YES Syn in Column 25"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column29:
print "YES Syn in Column 29"
count += 1
print "Count value:"
print count
if "SYNONYMOUS_CODING" in column31:
print "YES Syn in Column 31"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column23:
print "YES Non_Syn in Column 23"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column24:
print "YES Non_Syn in Column 24"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column25:
print "YES Non_Syn in Column 25"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column29:
print "YES Non_Syn in Column 29"
count += 1
print "Count value:"
print count
if "NON_SYNONYMOUS_CODING" in column31:
print "YES Non_Syn in Column 31"
count += 1
print "Count value:"
print count
但这是高度重复和非pythonic,就像我想要的那样......