我有一个包含4行但列数很多的.xls文件。我将它保存在制表符分隔的.txt文件中,如下所示
第一列很重要,每个字符串由,
分隔。
示例数据可在此处找到https://gist.github.com/anonymous/92a95026f9869790f209dc9ce8f55a59
A,B A13 This is India
AFD,DNGS,SGDH 3TR This is how it is
NHYG,QHD,lkd,uyete TRD Where to go
AFD,TTT YTR What to do
我希望每行合并一对,如果我们有多个对组合,则保持其他行重复
这就是我要找的东西
A B A13 This is India
AFD DNGS 3TR This is how it is
AFD SGDH 3TR This is how it is
DNGS SGDH 3TR This is how it is
NHYG QHD TRD Where to go
NHYG lkd TRD Where to go
NHYG uyete TRD Where to go
QHD lkd TRD Where to go
QHD uyete TRD Where to go
lkd uyete TRD Where to go
AFD TTT YTR What to do
让我们调用我的第一个数据Data
我试过的是逐行阅读
import itertools
lines = open("data.txt").readlines()
for line in lines:
myrows = line.split(",")
out_list = []
for i in range(1, len(myrows)+1):
out_list.extend(itertools.combinations(lines, i))
答案 0 :(得分:2)
我认为你有正确的想法使用itertools.combinations()
,但你只需要在第一列元素中运行它,而不是在整行上运行它。
这是我的解决方案:
import StringIO
import itertools
data = """"A,B " A13 This is India
"AFD,DNGS,SGDH " 3TR This is how it is
"NHYG,QHD,lkd,uyete" TRD Where to go
"AFD,TTT" YTR What to do"""
for line in StringIO.StringIO(data):
e1,e2 = line.split('\t', 1) # extract the first part (e1) and the rest of the line (e2)
es = e1.replace('"','').strip().split(',') # remove extra "" and whitespace.
# then split each element in a tuple
for i in itertools.combinations(es,2): # iterate over all combinations of 2 elements
print '{}\t{}'.format('\t'.join(i),e2)
结果:
A B A13 This is India
AFD DNGS 3TR This is how it is
AFD SGDH 3TR This is how it is
DNGS SGDH 3TR This is how it is
NHYG QHD TRD Where to go
NHYG lkd TRD Where to go
NHYG uyete TRD Where to go
QHD lkd TRD Where to go
QHD uyete TRD Where to go
lkd uyete TRD Where to go
AFD TTT YTR What to do
修改强>
这是修改后的版本。
请注意enumerate()
f.readlines()
返回当前行的索引
import itertools
with open('data.txt') as f:
header = f.readline()
with open('result.txt','w') as w:
w.write(header)
for n,line in enumerate(f.readlines()):
elems = line.split('\t')
e0 = elems[0].split(',')
e0 = [e.replace('"','').strip() for e in e0]
for pairs in itertools.combinations(e0,2):
w.write('{:d}\t{}\t{}\n'.format(n+1,'\t'.join(pairs),'\t'.join(elems[1:])))
答案 1 :(得分:0)
you_data.txt:
"Q92828, O60907, O75376" 15 NCOR complex Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926 "By using specific small interference RNAs (siRNAs), the authors demonstrate that HDAC3 is essential, whereas TBL1 and TBLR1 are functionally redundant but essential for repression by unliganded thyroid hormone receptor."
"O15143, O15144, O15145, P61158, P61160, P59998, O15511" 27 Arp2/3 protein complex Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
"Q9UL46, Q06323" 30 PA28 complex 11S REG Human MI:0071- molecular sieving | MI:0226- ion exchange chromatography 9325261 "PA28 is a regulatory complex of the 20S proteasome. It acts as proteasome activator and stimulates cleavage after basic, acidic, and most hydrophobic residues in many peptides."
"P55036, P62333, O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O00233" 32 PA700 complex 19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964 "The proteasome is an essential component of the ATP-dependent proteolytic pathway in eukaryotic cells and is responsible for the degradation of most cellular proteins (for reviews see PMID:8811196 and PMID:10872471). It contains a barrel-shaped proteolytic core complex (the 20S proteasome), and is capped at one or both ends by regulatory complexes like the 19S complex (PMID:11812135), modulator (PMID:8621709), PA28 (PMID:9325261) and PA28gamma (PMID:9325261). Interferon-gamma (IFN-gamma) alters the peptide-degrading specificity of proteasomes and produces an immunoproteasome responsible for accelerated processing of nonself endogenous antigens by inducing the replacement of subunits Psmb5, Psmb6 and Psmb7 by Psmb8, Psmb9 and Psmb10, respectively."
代码:
import itertools
with open('you_data.txt') as f:
index = 1
for line in f:
split_line = line.split('"')
key = split_line[1].strip().split(',', 2)
value = split_line[2].strip().replace('\t',' ')
for pair in itertools.combinations(key, 2):
pair = [i.strip() for i in pair]
print('{:<4}{:8}{:8}{:20}'.format(index,*pair, value))
index += 1
出:
1 Q92828 O60907 15 NCOR complex Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926
1 Q92828 O75376 15 NCOR complex Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926
1 O60907 O75376 15 NCOR complex Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926
2 O15143 O15144 27 Arp2/3 protein complex Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
2 O15143 O15145, P61158, P61160, P59998, O1551127 Arp2/3 protein complex Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
2 O15144 O15145, P61158, P61160, P59998, O1551127 Arp2/3 protein complex Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
3 Q9UL46 Q06323 30 PA28 complex 11S REG Human MI:0071- molecular sieving | MI:0226- ion exchange chromatography 9325261
4 P55036 P62333 32 PA700 complex 19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964
4 P55036 O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O0023332 PA700 complex 19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964
4 P62333 O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O0023332 PA700 complex 19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964
答案 2 :(得分:0)
这让我想起了flatmap
。
import itertools
def tolist(s):
return s.split(',')
def tostring(l):
return [','.join([l[i],l[j]]) for i in range(len(l)) for j in range(len(l)) if i <j]
def tomult(ll):
return map( lambda x: [x]+ll[1:] ,tostring(tolist(ll[0])))
def tofmap(lines):
return list(itertools.chain(*map( lambda line: tomult(line) ,lines)))
a = [['c1,c2', '200', 'line one'], ['a,b,c', '100', 'this is good']]
b = tofmap(a)