Question

我在使用python时遇到了一些麻烦。我有一个制表符分隔的文本文件，如下所示：

SNP Name    ss715583617 ss715592335 ss715591044 ss7155(98181
Chromosome  Gm02    Gm05    Gm05    Gm07
Position    5581696 6943050 34695858    43520803
Cultivar Name               
PI065549    T   A   A   T
PI081762    T   A   A   T
PI101404A   T   A   A   T
PI101404B   T   A   A   T

我需要阅读这一行：将SNP名称分成一个数组，将染色体分成一个数组，省略位置，省略品种名称。然后将以PI065549 T A A T开头的数据放入二维数组中。我对python的看法是列表列表中的数据。我的问题是：

从我的代码中，我是否将数据放入列表列表中？
如何遍历具有索引位置的列表列表来分析数据？对于我的分析，其重要的是能够对其进行列式分析。

我的主要目标是通过列式分析来分析数据，以根据某些条件将字符转换为整数。

到目前为止我的代码是：

snpNames = [] #to hold snp names(column titles) 
chrm = [] #to hold chr, needed later for random sampling 
numLines = 0; # to determine how many lines in the files, needed to     determine size of 2 d array 

snps = [[]] ## store the snp data in a list of lists...

with open("/home/dfreese/Desktop/testSNPtext") as file: 

    #read in the first line these contain the names and store into an array 
    firstLine = file.readline().strip()
    for i in firstLine.split("\t"): 
        snpNames.append(i)

    #second line contains the chr data, read that into the chr array
    secondLine = file.readline().strip()
    for i in secondLine.split("\t"):
        chrm.append(i)


     ## read in the remaining lines and fill in the 2 d array    
    for line in lines: 
        snps.append(line.strip().split("\t"))

file.close()


#check that the data is ok 
for i in snps: 
    print (i)

任何帮助将不胜感激。我习惯用C ++和Java编码，但是对于这个数据分析，Python被要求了，我有点卡住了。任何建议或改进将不胜感激。

Answer 1

从我的代码中，我是否将数据放入列表列表中？

是的，你这样做。但是，您不需要snps = [[]]。 snps = []就足够了。

如何遍历具有索引位置的列表列表来分析数据？

for i, item in enumerate(snps):
    print(i, item)

Output:    
0 ['T', 'A', 'A', 'T']
1 ['T', 'A', 'A', 'T']
2 ['T', 'A', 'A', 'T']
...

如果Cultiver Name列中的第一项是唯一的，您可以考虑使用字典（{}）代替：

snps = {}

...

for line in lines:
    row_vals = line.strip().split('\t')
    snps[row_vals[0]] = row_vals[1:]

现在可以这样完成访问词典中的项目：

print(snps['PI065549'])

Ouput:
['T', 'A', 'A', 'T']

如果您计划进行任何类型的分析，可以考虑使用pandas。这是一个quick intro，它也谈到importing text files

Answer 2

你差不多了（这是python3.4，希望没问题）

from pathlib import Path

DATA_PATH = Path(__file__).parent / '../data/chromosome.txt'

snpNames = [] #to hold snp names(column titles) 
chrm = [] #to hold chr, needed later for random sampling 
numLines = 0; # to determine how many lines in the files, needed to
              # determine size of 2 d array 

snps_dict = {} ## store the snp data in a list of lists...
snps = []

with DATA_PATH.open('r') as file:
     #read in the first line these contain the names and store into an array 
    firstLine = file.readline().strip()
    for i in firstLine.split(): 
        snpNames.append(i)

    #second line contains the chr data, read that into the chr array
    secondLine = file.readline().strip()
    for i in secondLine.split():
        chrm.append(i)

     ## read in the remaining lines and fill in the 2 d array    
    for line in file.readlines():
        if line.startswith('Position'):
            continue
        elif line.startswith('Cultivar Name'):
            continue
        splt = line.split()     
        snps[splt[0]] = splt[1:]
        snps.append(splt)

print(snpNames)
print(chrm)
print(snps_dict)
print(snps)

产生：

['SNP', 'Name', 'ss715583617', 'ss715592335', 'ss715591044', 
 'ss7155(98181']
['Chromosome', 'Gm02', 'Gm05', 'Gm05', 'Gm07']
{'PI065549': ['T', 'A', 'A', 'T'], 'PI101404A': ['T', 'A', 'A', 'T'], 
 'PI081762': ['T', 'A', 'A', 'T'], 'PI101404B': ['T', 'A', 'A', 'T']}
[['PI065549', 'T', 'A', 'A', 'T'], ['PI081762', 'T', 'A', 'A', 'T'], 
 ['PI101404A', 'T', 'A', 'A', 'T'], ['PI101404B', 'T', 'A', 'A', 'T']]

Answer 3

如果您有制表符分隔文件，则可以使用pandas创建数据框：

import pandas as pd

df = pd.read_csv("/home/dfreese/Desktop/testSNPtext"),delimiter="\t",header=None,names=["a","b","c","d","e"]))

哪个应该给你一个类似的数据框：

                   a            b            c            d             e
0       SNP Name  ss715583617  ss715592335  ss715591044  ss7155(98181
1     Chromosome         Gm02         Gm05         Gm05          Gm07
2       Position      5581696      6943050     34695858      43520803
3  Cultivar Name          NaN          NaN          NaN           NaN
4       PI065549            A            A            T           NaN
5       PI081762            A            A            T           NaN
6      PI101404A            A            A            T           NaN
7      PI101404B            A            A            T           NaN

将最后四行放在一起：

 print(df.iloc[4:8])
           a  b  c  d    e
4   PI065549  A  A  T  NaN
5   PI081762  A  A  T  NaN
6  PI101404A  A  A  T  NaN
7  PI101404B  A  A  T  NaN

python：for循环列表中的列表或将数据更改为2维数组

3 个答案: