我在使用python时遇到了一些麻烦。我有一个制表符分隔的文本文件,如下所示:
SNP Name ss715583617 ss715592335 ss715591044 ss7155(98181
Chromosome Gm02 Gm05 Gm05 Gm07
Position 5581696 6943050 34695858 43520803
Cultivar Name
PI065549 T A A T
PI081762 T A A T
PI101404A T A A T
PI101404B T A A T
我需要阅读这一行:将SNP名称分成一个数组,将染色体分成一个数组,省略位置,省略品种名称。然后将以PI065549 T A A T开头的数据放入二维数组中。我对python的看法是列表列表中的数据。我的问题是:
我的主要目标是通过列式分析来分析数据,以根据某些条件将字符转换为整数。
到目前为止我的代码是:
snpNames = [] #to hold snp names(column titles)
chrm = [] #to hold chr, needed later for random sampling
numLines = 0; # to determine how many lines in the files, needed to determine size of 2 d array
snps = [[]] ## store the snp data in a list of lists...
with open("/home/dfreese/Desktop/testSNPtext") as file:
#read in the first line these contain the names and store into an array
firstLine = file.readline().strip()
for i in firstLine.split("\t"):
snpNames.append(i)
#second line contains the chr data, read that into the chr array
secondLine = file.readline().strip()
for i in secondLine.split("\t"):
chrm.append(i)
## read in the remaining lines and fill in the 2 d array
for line in lines:
snps.append(line.strip().split("\t"))
file.close()
#check that the data is ok
for i in snps:
print (i)
任何帮助将不胜感激。我习惯用C ++和Java编码,但是对于这个数据分析,Python被要求了,我有点卡住了。任何建议或改进将不胜感激。
答案 0 :(得分:0)
从我的代码中,我是否将数据放入列表列表中?
是的,你这样做。但是,您不需要snps = [[]]
。 snps = []
就足够了。
如何遍历具有索引位置的列表列表来分析数据?
for i, item in enumerate(snps):
print(i, item)
Output:
0 ['T', 'A', 'A', 'T']
1 ['T', 'A', 'A', 'T']
2 ['T', 'A', 'A', 'T']
...
如果Cultiver Name
列中的第一项是唯一的,您可以考虑使用字典({}
)代替:
snps = {}
...
for line in lines:
row_vals = line.strip().split('\t')
snps[row_vals[0]] = row_vals[1:]
现在可以这样完成访问词典中的项目:
print(snps['PI065549'])
Ouput:
['T', 'A', 'A', 'T']
如果您计划进行任何类型的分析,可以考虑使用pandas
。这是一个quick intro,它也谈到importing text files
答案 1 :(得分:0)
你差不多了(这是python3.4,希望没问题)
from pathlib import Path
DATA_PATH = Path(__file__).parent / '../data/chromosome.txt'
snpNames = [] #to hold snp names(column titles)
chrm = [] #to hold chr, needed later for random sampling
numLines = 0; # to determine how many lines in the files, needed to
# determine size of 2 d array
snps_dict = {} ## store the snp data in a list of lists...
snps = []
with DATA_PATH.open('r') as file:
#read in the first line these contain the names and store into an array
firstLine = file.readline().strip()
for i in firstLine.split():
snpNames.append(i)
#second line contains the chr data, read that into the chr array
secondLine = file.readline().strip()
for i in secondLine.split():
chrm.append(i)
## read in the remaining lines and fill in the 2 d array
for line in file.readlines():
if line.startswith('Position'):
continue
elif line.startswith('Cultivar Name'):
continue
splt = line.split()
snps[splt[0]] = splt[1:]
snps.append(splt)
print(snpNames)
print(chrm)
print(snps_dict)
print(snps)
产生:
['SNP', 'Name', 'ss715583617', 'ss715592335', 'ss715591044',
'ss7155(98181']
['Chromosome', 'Gm02', 'Gm05', 'Gm05', 'Gm07']
{'PI065549': ['T', 'A', 'A', 'T'], 'PI101404A': ['T', 'A', 'A', 'T'],
'PI081762': ['T', 'A', 'A', 'T'], 'PI101404B': ['T', 'A', 'A', 'T']}
[['PI065549', 'T', 'A', 'A', 'T'], ['PI081762', 'T', 'A', 'A', 'T'],
['PI101404A', 'T', 'A', 'A', 'T'], ['PI101404B', 'T', 'A', 'A', 'T']]
答案 2 :(得分:0)
如果您有制表符分隔文件,则可以使用pandas创建数据框:
import pandas as pd
df = pd.read_csv("/home/dfreese/Desktop/testSNPtext"),delimiter="\t",header=None,names=["a","b","c","d","e"]))
哪个应该给你一个类似的数据框:
a b c d e
0 SNP Name ss715583617 ss715592335 ss715591044 ss7155(98181
1 Chromosome Gm02 Gm05 Gm05 Gm07
2 Position 5581696 6943050 34695858 43520803
3 Cultivar Name NaN NaN NaN NaN
4 PI065549 A A T NaN
5 PI081762 A A T NaN
6 PI101404A A A T NaN
7 PI101404B A A T NaN
将最后四行放在一起:
print(df.iloc[4:8])
a b c d e
4 PI065549 A A T NaN
5 PI081762 A A T NaN
6 PI101404A A A T NaN
7 PI101404B A A T NaN