如何在python中替换给定特定条件的列?

时间:2016-01-07 23:29:46

标签: python dataframe

我的数据框包含多个列,如下所示......

Chr1    Cufflinks   exon    28354206    28354551    .   .   .   gene_id "XLOC_008369"; transcript_id "TCONS_00014347"; exon_number "1"; oId "CUFF.2405.1"; class_code "u"; tss_id "TSS10073";
Chr1    Cufflinks   exon    28785549    28786194    .   .   .   gene_id "XLOC_008370"; transcript_id "TCONS_00014348"; exon_number "1"; oId "CUFF.2441.1"; class_code "u"; tss_id "TSS10074";
Chr1    Cufflinks   exon    29328712    29329210    .   .   .   gene_id "XLOC_008371"; transcript_id "TCONS_00014349"; exon_number "1"; oId "CUFF.2495.1"; class_code "u"; tss_id "TSS10075";
Chr1    Cufflinks   exon    29427951    29428406    .   .   .   gene_id "XLOC_008372"; transcript_id "TCONS_00014350"; exon_number "1"; oId "CUFF.2506.1"; class_code "u"; tss_id "TSS10076";
Chr1    Cufflinks   exon    29460116    29460585    .   .   .   gene_id "XLOC_008373"; transcript_id "TCONS_00014351"; exon_number "1"; oId "CUFF.2509.1"; class_code "u"; tss_id "TSS10077";

我想要做的是,如果我的列表中的任何项目出现在数据框的其中一列中,那么我将第二列从Cufflinks替换为lincRNA。< / p>

一个问题是我用于使字典中的键在数据帧中有多行的列,因此我只获得唯一键,因此输出的行总数不同于输入。

到目前为止,这是我的代码......

#!/usr/bin/env python

file_in = open("lincRNA_final_transcripts.fa")
file_in2 = open("AthalianaslutteandluiN30merged.gtf")
file_out = open("updated.gtf", 'w')

sites = []
result = {}

for line in file_in:
    line = line.strip()
    if line.startswith(">"):
        line = line[1:]
        gene = str.split(line, ".")
        gene = gene[0]
        sites.append(gene)


for line2 in file_in2:
    line2 = line2.strip().split()
    line3 = str.split(line2[11], ";")
    line3 = line3[0]
    line3 = line3[1:-1]
    result[line3] = line2


for id in sites:
    id2 = str(id)
    if id2 in result.keys():
        result[id][1] = "lincRNA"

for val in result.values():
    file_out.write("\t".join(val))
    file_out.write("\n")

1 个答案:

答案 0 :(得分:2)

我将尝试在pandas中详细介绍如何执行此操作。 Pandas是一个用于处理数据帧的python库,学习它可以轻松地进行数据帧操作。

  1. 安装pandas

    sudo pip install pandas
    
  2. 将数据加载到pandas dataframe对象中。似乎gtf是制表符分隔文件,因此将\t作为分隔符传递。如果没有标题行传递None,如果第一行是标题,则传递0。有关参数的更多信息,请参阅here

    import pandas
    df = pd.read_csv('AthalianaslutteandluiN30merged.gtf', sep = '\t', header = None, engine = 'python')
    
        0      1             2       3       4     5 6 7            8  
    0   Chr1    Cufflinks   exon 28354206 28354551 . . .    gene_id "XLOC_008369"   transcript_id "TCONS_00014347"  exon_number "1" oId "CUFF.2405.1"   class_code "u"  tss_id "TSS10073"
    1   Chr1    Cufflinks   exon 28785549 28786194 . . .    gene_id "XLOC_008370"   transcript_id "TCONS_00014348"  exon_number "1" oId "CUFF.2441.1"   class_code "u"  tss_id "TSS10074"
    
  3. 检查第8列中的字符串是否包含也包含在sites列表中的子字符串。我们将使用this idea.

    sites = ["XLOC_008369", "XLOC_008369"]
    pattern = '|'.join(sites)
    mask = df[8].str.contains(pattern)
    
  4. 如果第8列包含与Cufflinks列表中的元素匹配的子字符串,则使用布尔索引将lincRNA更改为sites。有关pandas索引的更多信息,请参阅here

    df.loc[mask,1] = 'lincRNA'
    
  5. 编辑:使用str.contains检查pandas列是否包含列表中的元素。