python:for循环列表中的列表或将数据更改为2维数组

时间:2015-08-05 20:52:46

标签: python arrays list

我在使用python时遇到了一些麻烦。我有一个制表符分隔的文本文件,如下所示:

SNP Name    ss715583617 ss715592335 ss715591044 ss7155(98181
Chromosome  Gm02    Gm05    Gm05    Gm07
Position    5581696 6943050 34695858    43520803
Cultivar Name               
PI065549    T   A   A   T
PI081762    T   A   A   T
PI101404A   T   A   A   T
PI101404B   T   A   A   T

我需要阅读这一行:将SNP名称分成一个数组,将染色体分成一个数组,省略位置,省略品种名称。然后将以PI065549 T A A T开头的数据放入二维数组中。我对python的看法是列表列表中的数据。我的问题是:

  1. 从我的代码中,我是否将数据放入列表列表中?
  2. 如何遍历具有索引位置的列表列表来分析数据?对于我的分析,其重要的是能够对其进行列式分析。
  3. 我的主要目标是通过列式分析来分析数据,以根据某些条件将字符转换为整数。

    到目前为止我的代码是:

    snpNames = [] #to hold snp names(column titles) 
    chrm = [] #to hold chr, needed later for random sampling 
    numLines = 0; # to determine how many lines in the files, needed to     determine size of 2 d array 
    
    snps = [[]] ## store the snp data in a list of lists...
    
    with open("/home/dfreese/Desktop/testSNPtext") as file: 
    
        #read in the first line these contain the names and store into an array 
        firstLine = file.readline().strip()
        for i in firstLine.split("\t"): 
            snpNames.append(i)
    
        #second line contains the chr data, read that into the chr array
        secondLine = file.readline().strip()
        for i in secondLine.split("\t"):
            chrm.append(i)
    
    
         ## read in the remaining lines and fill in the 2 d array    
        for line in lines: 
            snps.append(line.strip().split("\t"))
    
    file.close()
    
    
    #check that the data is ok 
    for i in snps: 
        print (i)
    

    任何帮助将不胜感激。我习惯用C ++和Java编码,但是对于这个数据分析,Python被要求了,我有点卡住了。任何建议或改进将不胜感激。

3 个答案:

答案 0 :(得分:0)

  

从我的代码中,我是否将数据放入列表列表中?

是的,你这样做。但是,您不需要snps = [[]]snps = []就足够了。

  

如何遍历具有索引位置的列表列表来分析数据?

for i, item in enumerate(snps):
    print(i, item)

Output:    
0 ['T', 'A', 'A', 'T']
1 ['T', 'A', 'A', 'T']
2 ['T', 'A', 'A', 'T']
...

如果Cultiver Name列中的第一项是唯一的,您可以考虑使用字典({})代替:

snps = {}

...

for line in lines:
    row_vals = line.strip().split('\t')
    snps[row_vals[0]] = row_vals[1:]

现在可以这样完成访问词典中的项目:

print(snps['PI065549'])

Ouput:
['T', 'A', 'A', 'T']

如果您计划进行任何类型的分析,可以考虑使用pandas。这是一个quick intro,它也谈到importing text files

答案 1 :(得分:0)

你差不多了(这是python3.4,希望没问题)

from pathlib import Path

DATA_PATH = Path(__file__).parent / '../data/chromosome.txt'

snpNames = [] #to hold snp names(column titles) 
chrm = [] #to hold chr, needed later for random sampling 
numLines = 0; # to determine how many lines in the files, needed to
              # determine size of 2 d array 

snps_dict = {} ## store the snp data in a list of lists...
snps = []

with DATA_PATH.open('r') as file:
     #read in the first line these contain the names and store into an array 
    firstLine = file.readline().strip()
    for i in firstLine.split(): 
        snpNames.append(i)

    #second line contains the chr data, read that into the chr array
    secondLine = file.readline().strip()
    for i in secondLine.split():
        chrm.append(i)

     ## read in the remaining lines and fill in the 2 d array    
    for line in file.readlines():
        if line.startswith('Position'):
            continue
        elif line.startswith('Cultivar Name'):
            continue
        splt = line.split()     
        snps[splt[0]] = splt[1:]
        snps.append(splt)

print(snpNames)
print(chrm)
print(snps_dict)
print(snps)

产生:

['SNP', 'Name', 'ss715583617', 'ss715592335', 'ss715591044', 
 'ss7155(98181']
['Chromosome', 'Gm02', 'Gm05', 'Gm05', 'Gm07']
{'PI065549': ['T', 'A', 'A', 'T'], 'PI101404A': ['T', 'A', 'A', 'T'], 
 'PI081762': ['T', 'A', 'A', 'T'], 'PI101404B': ['T', 'A', 'A', 'T']}
[['PI065549', 'T', 'A', 'A', 'T'], ['PI081762', 'T', 'A', 'A', 'T'], 
 ['PI101404A', 'T', 'A', 'A', 'T'], ['PI101404B', 'T', 'A', 'A', 'T']]

答案 2 :(得分:0)

如果您有制表符分隔文件,则可以使用pandas创建数据框:

import pandas as pd

df = pd.read_csv("/home/dfreese/Desktop/testSNPtext"),delimiter="\t",header=None,names=["a","b","c","d","e"]))

哪个应该给你一个类似的数据框:

                   a            b            c            d             e
0       SNP Name  ss715583617  ss715592335  ss715591044  ss7155(98181
1     Chromosome         Gm02         Gm05         Gm05          Gm07
2       Position      5581696      6943050     34695858      43520803
3  Cultivar Name          NaN          NaN          NaN           NaN
4       PI065549            A            A            T           NaN
5       PI081762            A            A            T           NaN
6      PI101404A            A            A            T           NaN
7      PI101404B            A            A            T           NaN

将最后四行放在一起:

 print(df.iloc[4:8])
           a  b  c  d    e
4   PI065549  A  A  T  NaN
5   PI081762  A  A  T  NaN
6  PI101404A  A  A  T  NaN
7  PI101404B  A  A  T  NaN