我正在尝试从drugbank下载转换xml文件。每当我尝试在excel 2007中导入它时都说无法导入。也许是因为它的大小。任何人都可以建议,如果有任何其他方式我可以打开这个文件,所以我可以保存为tab-delim?它是第一个文件(所有药物,包括目标,转运蛋白,载体和酶信息),xml格式的http://www.drugbank.ca/downloads
答案 0 :(得分:2)
这完全重写了我的原始答案。
对于我原来的答案,我对drugbank.xml进行了有限的分析。我稍微对冲,但表示结构太复杂,无法转换为任何标准制表符分隔文件。通过这个,我的意思是一个可以由任何标准程序处理的制表符分隔文件。我支持该声明,但可以创建一个可能有用的非标准分隔文件。
下表显示了drugbank.xml的结构。
列是索引,级别,名称,父级和重复。对于元素药物和伴侣,重复是重复的实际数量。对于其他元素,它是其父元素出现时的最大重复次数。
Inx Lvl Name------------------------------------ Pnt Repeats
1 1 drugs 0 1
2 2 drug 1 6711
3 3 drugbank-id 2 1
4 3 name 2 1
5 3 description 2 1
6 3 cas-number 2 1
7 3 general-references 2 1
8 3 synthesis-reference 2 1
9 3 indication 2 1
10 3 pharmacology 2 1
11 3 mechanism-of-action 2 1
12 3 toxicity 2 1
13 3 biotransformation 2 1
14 3 absorption 2 1
15 3 half-life 2 1
16 3 protein-binding 2 1
17 3 route-of-elimination 2 1
18 3 volume-of-distribution 2 1
19 3 clearance 2 1
20 3 secondary-accession-numbers 2 1
21 4 secondary-accession-number 20 5
22 3 groups 2 1
23 4 group 22 3
24 3 taxonomy 2 1
25 4 kingdom 24 1
26 4 substructures 24 1
27 5 substructure 26 35
28 3 synonyms 2 1
29 4 synonym 28 82
30 3 salts 2 1
31 4 salt 30 17
32 3 brands 2 1
33 4 brand 32 230
34 3 mixtures 2 1
35 4 mixture 34 340
36 5 name 35 1
37 5 ingredients 35 1
38 3 packagers 2 1
39 4 packager 38 173
40 5 name 39 1
41 5 url 39 1
42 3 manufacturers 2 1
43 4 manufacturer 42 91
44 3 prices 2 1
45 4 price 44 172
46 5 description 45 1
47 5 cost 45 1
48 5 unit 45 1
49 3 categories 2 1
50 4 category 49 11
51 3 affected-organisms 2 1
52 4 affected-organism 51 3
53 3 dosages 2 1
54 4 dosage 53 22
55 5 form 54 1
56 5 route 54 1
57 5 strength 54 1
58 3 atc-codes 2 1
59 4 atc-code 58 36
60 3 ahfs-codes 2 1
61 4 ahfs-code 60 11
62 3 patents 2 1
63 4 patent 62 5
64 5 number 63 1
65 5 country 63 1
66 5 approved 63 1
67 5 expires 63 1
68 3 food-interactions 2 1
69 4 food-interaction 68 6
70 3 drug-interactions 2 1
71 4 drug-interaction 70 246
72 5 drug 71 1
73 5 name 71 1
74 5 description 71 1
75 3 protein-sequences 2 1
76 4 protein-sequence 75 10
77 5 header 76 1
78 5 chain 76 1
79 3 calculated-properties 2 1
80 4 property 79 18
81 5 kind 80 1
82 5 value 80 1
83 5 source 80 1
84 3 experimental-properties 2 1
85 4 property 84 4
86 5 kind 85 1
87 5 value 85 1
88 5 source 85 1
89 3 external-identifiers 2 1
90 4 external-identifier 89 13
91 5 resource 90 1
92 5 identifier 90 1
93 3 external-links 2 1
94 4 external-link 93 4
95 5 resource 94 1
96 5 url 94 1
97 3 targets 2 1
98 4 target 97 144
99 5 actions 98 1
100 6 action 99 2
101 5 references 98 1
102 5 known-action 98 1
103 3 enzymes 2 1
104 4 enzyme 103 19
105 5 actions 104 1
106 6 action 105 3
107 5 references 104 1
108 3 transporters 2 1
109 4 transporter 108 24
110 5 actions 109 1
111 6 action 110 3
112 5 references 109 1
113 3 carriers 2 1
114 4 carrier 113 6
115 5 actions 114 1
116 6 action 115 1
117 5 references 114 1
118 2 partners 1 1
119 3 partner 118 4227
120 4 name 119 1
121 4 general-function 119 1
122 4 specific-function 119 1
123 4 gene-name 119 1
124 4 locus 119 1
125 4 reaction 119 1
126 4 signals 119 1
127 4 cellular-location 119 1
128 4 transmembrane-regions 119 1
129 4 theoretical-pi 119 1
130 4 molecular-weight 119 1
131 4 chromosome 119 1
132 4 species 119 1
133 5 category 132 1
134 5 name 132 1
135 5 uniprot-name 132 1
136 5 uniprot-taxon-id 132 1
137 4 essentiality 119 1
138 4 references 119 1
139 4 external-identifiers 119 1
140 5 external-identifier 139 9
141 6 resource 140 1
142 6 identifier 140 1
143 4 synonyms 119 1
144 5 synonym 143 38
145 4 protein-sequence 119 1
146 5 header 145 1
147 5 chain 145 1
148 4 gene-sequence 119 1
149 5 header 148 1
150 5 chain 148 1
151 4 pfams 119 1
152 5 pfam 151 15
153 6 identifier 152 1
154 6 name 152 1
155 4 go-classifiers 119 1
156 5 go-classifier 155 49
157 6 category 156 1
158 6 description 156 1
我有一个为客户开发的实用程序,它无法处理发送的大量XML文档。我将所选信息提取到分隔文件中。虽然这些XML文档非常庞大,但结构很简单,在2级元素中没有重复。我想知道我是否可以增强实用程序以接受重复并将数据输出到分隔文件,尽管是非标准分隔文件。我现在知道我可以,虽然我不确定分隔文件有多么有用。
我的输出有97列,每个叶元素一个。有六个标题行,每个级别一个。列出叶元素及其父元素。重复元素时,该值将放在下一个可用行上。我希望,前三个药物文件的行中的几列将清楚地表明这一点。请注意,此列显示已截断列61。
|Column 1 |Column 2 |Column 18 |Column 25 |Column 56 |Column 60 |Column 61 |Column 62 |
|drugs |drugs |drugs |drugs |drugs |drugs |drugs |drugs |
|drug |drug |drug |drug |drug |drug |drug |drug |
|drugbank-id|name |secondary-accession-numbers|mixtures |external-identifiers |targets |targets |targets |
| | |secondary-accession-number |mixture |external-identifier |target |target |target |
| | | |name |resource |actions |references |known-action|
| | | | | |action | | |
|DB00001 |Lepirudin |BIOD00024 | |Drugs Product Database (DPD)|inhibitor |# Turpie AG: Anticoagulants in|yes |
| | |BTD00024 | |National Drug Code Directory| | | |
| | | | |PharmGKB | | | |
| | | | |UniProtKB | | | |
|DB00002 |Cetuximab |BIOD00071 | |National Drug Code Directory|antagonist|# Hosokawa N, Yamamoto S, Ueha|yes |
| | |BTD00071 | |GenBank | |# Snyder LC, Astsaturov I, Wei|unknown |
| | | | |PharmGKB | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Negri DR, Tosi E, Valota O, |unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
| | | | | | |# Overington JP, Al-Lazikani B|unknown |
|DB00003 |Dornase Alfa|BIOD00001 |Cauterex |Drugs Product Database (DPD)| |# Cramer GW, Bosso JA: The rol|yes |
| | |BTD00001 |Clorfibrase|GenBank | | | |
| | | |Elase |PharmGKB | | | |
| | | |Fibrabene |UniProtKB | | | |
| | | |Fibrase SA | | | | |
| | | |Fibrolan | | | | |
| | | |Parkelase | | | | |
| | | |Ridasa | | | | |
| | | | | | | | |
结果文件有135,713行,长度为52,171,387字节。这个或一些简单的变化是否有用?