我如何获得微阵列数据?

时间:2017-04-12 11:20:56

标签: python-3.x

谢谢你的帮助。我想使用以下python代码来读取和处理来自affymetrix微阵列数据集的数据。我想阐明在单核细胞中克罗恩病和溃疡性结肠炎的疾病状况中的差异基因表达。代码运行完美,但是当我尝试查看X的内容时,我在输出中得到一个空数组(如:array([],dtype = float64)),这当然没用。以下是原始数据集的链接:https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615 我试图找出为什么我有一个空的和不可处理的输出,但无济于事。这是代码:

import gzip
import numpy as np

"""
Read in a SOFT format data file.  The following values can be exported:

GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X   : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"

## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
    SIF = {}
    for line in fid:
        if line.startswith(line, len("!dataset_table_begin")):
            break
        elif line.startswith(line, len("!subject_description")):
            subset_description = line.split("=")[1].strip()
        elif line.startswith(line, len("!subset_sample_id")):
            subset_ids = [x.strip() for x in subset_ids]
            for k in subset_ids:
                SIF[k] = subset_description
    ## Next line is the column headers (sample id's)
    SID = next(fid).split("\t")

    ## The column indices that contain gene expression data
    I = [i for i,x in enumerate(SID) if x.startswith("GSM")]

    ## Restrict the column headers to those that we keep
    SID = [SID[i] for i in I]

    ## Get a list of sample labels
    STP = [SIF[k] for k in SID]

    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:

        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith("!dataset_table_end"):
            break

        V = line.split("\t")

        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]

        X.append(x)
        GID.append(V[0] + ";" + V[1])
X = np.array(X)

## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]

## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

在控制台,我得到这样的输出: X Out [94]:array([],dtype = float64)

X.shape Out [95] :( 0,)

再次感谢您的建议。

1 个答案:

答案 0 :(得分:0)

这非常有效:

    import gzip
    import numpy as np


    """
    Read in a SOFT format data file.  The following values can be exported:

    GID : A list of gene identifiers of length d
    SID : A list of sample identifiers of length n
    STP : A list of sample desriptions of length d
    X   : A dxn array of gene expression values
    """
    #path to the data file
    fname = "../data/GDS1615_full.soft.gz"

    ## Open the data file directly as a gzip file
    with gzip.open(fname) as fid:
        SIF = {}
        for line in fid:
            if line.startswith(b"!dataset_table_begin"):
                break
            elif line.startswith(b"!subset_description"):

                subset_description = line.decode('utf8').split("=")[1].strip() 
            elif line.startswith(b"!subset_sample_id"):
                subset_ids = line.decode('utf8').split("=")[1].split(",")
                subset_ids = [x.strip() for x in subset_ids]
                for k in subset_ids:
                    SIF[k] = subset_description
        ## Next line is the column headers (sample id's)
        SID = next(fid).split(b"\t")
        ## The column indices that contain gene expression data
        I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")]
        ## Restrict the column headers to those that we keep
        SID = [SID[i] for i in I]
        ## Get a list of sample labels   
        STP = [SIF[k.decode('utf8')] for k in SID]
    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:
        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith(b"!dataset_table_end"):
            break
        V = line.split(b"\t")
        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]
        X.append(x)
        GID.append(V[0].decode() + ";" + V[1].decode())

X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

结果:

X.shape 出[4] :( 22283,127)