问题:
输入是制表符分隔文件。行是变量,列是样本。变量可以假定三个值(00,01,11),并且以需要保持的顺序(v1-> vN)列出。有大量的行和列,因此输入文件需要以块的形式读取。
输入如下:
s1 s2 s3 s4
v1 00 00 11 01
v2 00 00 00 00
v3 01 11 00 00
v4 00 00 00 00
(...)
我要做的是将输入分成几行,其中的部分足够大,每个样本都是唯一的。在上面的例子中,从v1开始,第一个块应该在v3结束,因为在该点有足够的信息,样本是唯一的。下一个块将从v4开始并重复该过程。当到达最后一行时,任务结束。这些块应打印在输出文件中。
我的尝试:
我想要做的是使用csv模块生成一个由列表组成的数组,每个列表包含所有样本的单个变量(00,01,00)的状态。或者,通过旋转输入,创建包含每个变量的样本状态的列表。我问工作应该集中在列还是行上, ie 如果最好使用v1 = ['00','00','11','01']或s1 = [ '00', '00', '01', '00',...]
以下代码指的是我尝试将列问题更改为行问题的数据透视操作。 (抱歉笨拙的python语法,是我能做的最好的)
my_infilename='my_file.txt'
csv_infile=csv.reader(open(my_infilename,'r'), delimiter='\t')
out=open('transposed_'+my_infilename, 'w')
csv_infile=zip(*csv_infile)
line_n=0
for line in csv_infile:
line_n+=1
if line_n==1: #headers
continue
else:
line=(','.join(line)+'\n') #just to make it readable to me
out.write(line)
out.close()
解决此问题的最佳方法是什么?可以转动任何帮助吗?我可以依赖任何内置函数吗?
答案 0 :(得分:2)
假设您将csv数据导入为长度相同的列表列表,这对您有何帮助......
def get_block(data_rows):
samples = []
for cell in data_rows[0]:
samples.append('')
# add one row at a time to each sample and see if all are unique
for row_index, row in enumerate(data_rows):
for cell_index, cell in enumerate(row):
samples[cell_index] = '%s%s' % (samples[cell_index], cell)
are_all_unique = True
sample_dict = {} # use dictionary keys to find repeats
for sample in samples:
if sample_dict.get(sample):
# already there, so another row needed
are_all_unique = False
break
sample_dict[sample] = True # add the key to the dictionary
if are_all_unique:
return True, row_index
return False, None
def get_all_blocks(all_rows):
remaining_rows = all_rows[:] # make a copy
blocks = []
while True:
found_block, block_end_index = get_block(remaining_rows)
if found_block:
blocks.append(remaining_rows[:block_end_index+1])
remaining_rows = remaining_rows[block_end_index+1:]
if not remaining_rows:
break
else:
blocks.append(remaining_rows[:])
break
return blocks
if __name__ == "__main__":
v1 = ['00', '00', '11', '01']
v2 = ['00', '00', '00', '00']
v3 = ['01', '11', '00', '00']
v4 = ['00', '00', '00', '00']
all_rows = [v1, v2, v3, v4]
blocks = get_all_blocks(all_rows)
for index, block in enumerate(blocks):
print "This is block %s." % index
for row in block:
print row
print
=================
这是第0块。
['00','00','11','01']
['00','00','00','00']
['01','11','00','00']
这是第1块。
['00','00','00','00']
答案 1 :(得分:0)
我根本不理解你的问题(“纵坐标变量”?“单义确定样本”?),但我知道你正在使用csv模块错误,你的缩进也不正确。
我不确切知道你输入的文件是什么样的,但是假设它是以制表符分隔的,下面的(未经测试的)脚本显示了一种从输入文件中取出块,转换它们并重写为你的块的方法。输出文件。
import csv
# this is not strictly necessary, but you can define a custom dialect for input and output
class SampleDialect (csv.Dialect):
delimiter = "\t"
quoting = csv.QUOTE_NONE
sampledialect = SampleDialect()
ifn = 'my_file.txt'
ofn = 'transposed_'+ifn
ifp = open(ifn, 'rb')
ofp = open(ofn, 'wb')
incsv = csv.reader(ifp, dialect=sampledialect)
outcsv = csv.writer(ofp, dialect=sampledialect)
header = None
block = []
for lineno, samples in enumerate(incsv):
if lineno==0: #header
header = samples
continue
block.append(samples)
if lineno%3:
# end of block
# do something with block
# then write it out
outcsv.writerows(block)
block = []
ifp.close()
ofp.close()