创建一个允许标题行和行名称列的函数

时间:2011-05-08 22:29:02

标签: python function bioinformatics

我正在定义一个函数,它将返回一个列表列表,其中元素零是2Darray,元素一是头信息,元素2是rowname。如何从

的文件中读取此内容

文件如下所示:

基因S1 S2 S3 S4 S5

100 -0.243 -0.021 -0.205 -1.283 0.411

10000 -1.178 -0.79 0.063 -0.878 0.011

def input2DarrayData(fn):
    # define twoDarray, headerLine and rowLabels
    twoDarray = []
    # open filehandle
    fh = open(fileName)
    # collect header information


    # read in the rest of the data and organize it into a list of lists
    for line in fh:
        # split line into columns and append to array
        arrayCols = line.strip().split('\t')
        # collect rowname information

        **what goes here?**


        # convenient float conversion for each element in the list using the
        # map function. note that this assumes each element is a number and can
        # be cast as a float. see floatizeData(), which gives the explicit
        # example of how the map function works conceptually.
        twoDarray.append(map(float, arrayCols))
    # return data
    return twoDarray

我一直收到一个错误,说它无法将文件(基因)中的第一个单词转换为浮点数,因为它是一个字符串。所以我的问题是弄清楚如何阅读第一行

2 个答案:

答案 0 :(得分:1)

def input2DarrayData(fn):
    # define twoDarray, headerLine and rowLabels
    twoDarray = []
    headerLine = None
    rowLabels = []
    # open filehandle
    fh = open(fn)

    headerLine = fh.readline()
    headerLine = headerLine.strip().split('\t')

    for line in fh:
        arrayCols = line.strip().split('\t')
        rowLabels.append(arrayCols[0])

        twoDarray.append(map(float, arrayCols[1:]))
    # return data
    return [twoDarray, headerLine, rowLabels]

如果这对您有用,请阅读PEP-8并重构变量和函数名称。另外别忘了关闭文件。最好使用with为您关闭它:

def input2DarrayData(fn):
    ""
    twoDarray = []
    rowLabels = []
    #
    with open(fn) as fh:
       headerLine = fh.readline()
       headerLine = headerLine.strip().split('\t')
       for line in fh:
           arrayCols = line.strip().split('\t')
           rowLabels.append(arrayCols[0])
           twoDarray.append(map(float, arrayCols[1:]))
    #
    return [twoDarray, headerLine, rowLabels]

答案 1 :(得分:1)

要处理标题行(文件中的第一行),请在迭代剩余行之前使用.readline()显式使用它:

fh = open(fileName)
headers = fh.readline().strip().split('\t')
for line in fh:
    arrayCols = line.strip().split('\t')
    ## etc...

我不确定你想从文件中获取什么数据结构;您似乎暗示您希望每行包含标题的列表。复制这样的标题并没有多大意义。

假设一个包含标题行的相当简单的文件结构,以及每行固定数量的列,以下是一个生成每行使用标题作为键,列值作为值的字典的生成器:

def process_file(filepath):
    ## open the file
    with open('my_file') as src:
        ## read the first line as headers
        headers = src.readline().strip().split('\t')
        for line in src:
            ## Split the line
            line = line.strip().split('\t')
            ## Coerce each value to a float
            line = [float(col) for col in line]
            ## Create a dictionary using headers and cols
            line_dict = dict(zip(headers, line))
            ## Yield it
            yield line_dict

>>> for row in process_file('path/to/myfile'):
...     print row
>>> 
>>> {'genes':100.00, 'S1':-0.243, 'S2':-0.021, 'S3':-0.205,  'S4': -1.283, 'S5': 0.411}
>>> {'genes':10000.00, 'S1':-1.178, 'S2':-0.79, 'S3':0.063,  'S4': -0.878, 'S5': 0.011}