我已经写下了一个代码,用于在标签""基础上拆分12GB文件。关于Stackoverflo的这个答案
Splitting XML file into multiple at given tags
import xml.etree.ElementTree as ET
context = ET.iterparse('data.xml', events=('end', ))
for event, elem in context:
if elem.tag == 'row':
title = elem.find('</Seq-entry>').text
filename = format(title + ".xml")
with open(filename, 'wb') as f:
f.write(ET.tostring(elem))
但它显示了一个奇怪的错误:
Traceback (most recent call last):
File "s.py", line 3, in <module>
for event, elem in context:
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1271, in next
raise e
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0
如何纠正错误..请帮忙。
感谢
更新
data.xml中
<Seq-entry>
</User-field>
<User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Pseudo Genes (incomplete)</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>100 of 262</User-field_data_str>
<User-field_data>
<User-field_data_str>16 of 262</User-field_data_str>
</User-field_data>
</User-field>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>StructuredCommentSuffix</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<Object-id>
<Object-id_str>FeatureFetchPolicy</Object-id_str>
</Object-id>
<User-field_data>
<User-field_data_str>OnlyNearFeatures</User-field_data_str>
</User-field_data>
</User-field>
</User-object_data>
</User-object>
</Seqdesc_user>
</Seqdesc>
<Seqdesc>
<Seqdesc_user>
<User-object>
<User-object_type>
</User-field_label>
<User-field_num>1</User-field_num>
<Object-id_str>Assembly</Object-id_str>
</Object-id>
</User-field_label>
<User-field_num>1</User-field_num>
<User-field_data>
<User-field_data_strs>
<Object-id>
<Object-id_str>StructuredComment</Object-id_str>
</Object-id>
</User-object_type>
<User-object_data>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>StructuredCommentPrefix</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>##Genome-Assembly-Data-START##</User-field_data_str>
</User-field_data>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Assembly Method</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>Newbler v. 2.6</User-field_data_str>
</User-field_data>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Genome Representation</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>Full</User-field_data_str>
</User-field_data>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Expected Final Version</Object-id_str>
</User-field_label>
<User-field_data>
<User-field_data_str>53.0x</User-field_data_str>
</User-field_data>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Sequencing Technology</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>IonTorrent</User-field_data_str>
</User-field_data>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>StructuredCommentSuffix</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>##Genome-Assembly-Data-END##</User-field_data_str>
</User-field_data>
</User-field>
</User-object_data>
</User-object>
</Seqdesc_user>
</Seqdesc>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Krasnov</Name-std_last>
<Name-std_first>Ya</Name-std_first>
<Name-std_initials>Y.M.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Alkhova</Name-std_last>
<Name-std_first>Zh</Name-std_first>
<Name-std_initials>Z.V.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Shchelkanova</Name-std_last>
<Name-std_first>E</Name-std_first>
<Name-std_initials>E.Y.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Smirnova</Name-std_last>
<Name-std_first>N</Name-std_first>
<Name-std_initials>N.I.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Kutyrev</Name-std_last>
<Name-std_first>V</Name-std_first>
<Name-std_initials>V.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
</Auth-list_names_std>
</Auth-list_names>
<Auth-list_affil>
<Affil>
<Affil_std>
<Affil_std_affil>RARI</Affil_std_affil>
<Affil_std_div>Mikrobiologie</Affil_std_div>
<Affil_std_city>Saratov</Affil_std_city>
<Affil_std_sub>Saratov region</Affil_std_sub>
<Affil_std_country>Russian Federation</Affil_std_country>
<Affil_std_street>Universitetskaya 46</Affil_std_street>
<Affil_std_postal-code>410005</Affil_std_postal-code>
</Affil_std>
</Affil>
</Auth-list_affil>
</Auth-list>
</Cit-sub_authors>
<Cit-sub_date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2017</Date-std_year>
<Date-std_month>3</Date-std_month>
<Date-std_day>1</Date-std_day>
</Date-std>
</Date_std>
</Date>
</Cit-sub_date>
</Cit-sub>
</Pub_sub>
</Pub>
</Pub-equiv>
</Pubdesc_pub>
</Pubdesc>
</Seqdesc_pub>
</Seqdesc>
<Seqdesc>
<Seqdesc_pub>
<Pubdesc>
<Pubdesc_pub>
<Pub-equiv>
<Pub>
<Pub_gen>
<Cit-gen>
<Cit-gen_cit>Unpublished</Cit-gen_cit>
<Cit-gen_authors>
<Auth-list>
<Auth-list_names>
<Auth-list_names_std>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Agafonova</Name-std_last>
<Name-std_first>E</Name-std_first>
<Name-std_initials>E.Y.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Krasnov</Name-std_last>
<Name-std_first>Ya</Name-std_first>
<Name-std_initials>Y.M.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Alkhova</Name-std_last>
<Name-std_first>Zh</Name-std_first>
<Name-std_initials>Z.V.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Shchelkanova</Name-std_last>
<Name-std_first>E</Name-std_first>
<Name-std_initials>E.Y.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Smirnova</Name-std_last>
<Name-std_first>N</Name-std_first>
<Name-std_initials>N.I.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
<Author>
<Author_name>
<Person-id>
<Person-id_name>
<Name-std>
<Name-std_last>Kutyrev</Name-std_last>
<Name-std_first>V</Name-std_first>
<Name-std_initials>V.</Name-std_initials>
</Name-std>
</Person-id_name>
</Person-id>
</Author_name>
</Author>
</Auth-list_names_std>
</Auth-list_names>
</Auth-list>
</Cit-gen_authors>
<Cit-gen_title>The outbreak of cholera in Mariupol in 2011</Cit-gen_title>
</Cit-gen>
</Pub_gen>
</Pub>
</Pub-equiv>
</Pubdesc_pub>
</Pubdesc>
</Seqdesc_pub>
</Seqdesc>
<Seqdesc>
<Seqdesc_source>
<BioSource>
<BioSource_genome value="genomic">1</BioSource_genome>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Vibrio cholerae</Org-ref_taxname>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>666</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_orgname>
<OrgName>
<OrgName_name>
<OrgName_name_binomial>
<BinomialOrgName>
<BinomialOrgName_genus>Vibrio</BinomialOrgName_genus>
<BinomialOrgName_species>cholerae</BinomialOrgName_species>
</BinomialOrgName>
</OrgName_name_binomial>
</OrgName_name>
<User-field_data_str>MWRE01000175</User-field_data_str>
</User-field_data>
</User-field>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>gi</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_int>1208991974</User-field_data_int>
</User-field_data>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Status</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>pipeline</User-field_data_str>
</User-field_data>
</Object-id>
</User-object_type>
<User-object_data>
<User-field>
<User-field_label>
<Object-id>
<Object-id_str>Policy</Object-id_str>
</Object-id>
</User-field_label>
<User-field_data>
<User-field_data_str>OnlyNearFeatures</User-field_data_str>
</User-field_data>
</User-field>
</User-object_data>
</User-object>
</Seqdesc_user>
</Seqdesc>
<Seqdesc>
<Seqdesc_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2017</Date-std_year>
<Date-std_month>6</Date-std_month>
<Date-std_day>24</Date-std_day>
</Date-std>
</Date_std>
</Date>
</Seqdesc_update-date>
</Seqdesc>
<Seqdesc>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Feat-id_local>
<Object-id>
<Object-id_id>11567</Object-id_id>
</Object-id>
</Feat-id_local>
</Feat-id>
</Seq-feat_id>
<Seq-feat_data>
<SeqFeatData>
</Seq-entry>
<SeqFeatData_gene>
<Gene-ref>
<Gene-ref_locus-tag>B2J70_RS19360</Gene-ref_locus-tag>
</Gene-ref>
</SeqFeatData_gene>
</SeqFeatData>
</Seq-feat_data>
<Seq-feat_location>
<Seq-loc>
<Seq-loc_int>
<Object-id>
<Object-id_str>ModelEvidence</Object-id_str>
</Object-id>
</User-object_type>
<User-object_data>
<User-field>
</Seq-entry>
<User-field_label>
<Object-id>
<Object-id_str>Method</Object-id_str>
</Feat-id>
</Seq-feat_id>
<Seq-feat_data>
<SeqFeatData>
<SeqFeatData_gene>
<Gene-ref>
<Gene-ref_locus-tag>B2J70_RS19365</Gene-ref_locus-tag>
</Gene-ref>
</SeqFeatData_gene>
</SeqFeatData>
</Seq-feat_data>
<Seq-feat_location>
<Seq-loc>
</Seq-entry>
<Trna-ext>
<Trna-ext_aa>
<Trna-ext_aa_ncbieaa>81</Trna-ext_aa_ncbieaa>
</Trna-ext_aa>
<Trna-ext_anticodon>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>157</Seq-interval_from>
<Seq-interval_to>159</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>1209940906</Seq-id_gi>
</Seq-id>
</Seq-entry>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</RNA-ref_ext_tRNA>
</Gb-qual>
</Seq-feat_qual>
<Seq-feat_exts>
</Seq-entry>
<User-object>
<User-object_type>
<Object-id>
<Object-id_str>ModelEvidence</Object-id_str>
</Object-id>
</User-object_type>
</Seq-entry>
<User-object_data>
<User-field>
<User-field_label>
<Object-id>
</Seq-entry>
</Seq-annot>
</Bioseq_annot>
</Bioseq>
</Seq-entry_seq>
</Seq-entry>
我使用原始文件创建了此文件,并添加了标记&#34;&#34;随机。
答案 0 :(得分:1)
首先你的data.xml是无效的xml,所以请把它变成有效的XML。我尝试了一小部分数据xml。
test.xml for test:
<Seq-entry>
<User-field>
<User-field_data>
<User-field_data_str>100 of 262</User-field_data_str>
</User-field_data>
<User-field_data>
<User-field_data_str>16 of 262</User-field_data_str>
</User-field_data>
</User-field>
</Seq-entry>
test.py:
import xml.etree.ElementTree as ET
context = ET.iterparse('data.xml', events=('end', ))
for event, elem in context:
if elem.tag == 'User-field_data':
title = elem.find('User-field_data_str').text
filename = format(title + ".xml")
with open(filename, 'wb') as f:
f.write(ET.tostring(elem))
通过这种方式,您可以将xml拆分到所需的位置。