基于给定标记拆分大文件

时间:2017-09-20 11:31:52

标签: python python-3.x

我已经写下了一个代码,用于在标签""基础上拆分12GB文件。关于Stackoverflo的这个答案

Splitting XML file into multiple at given tags

import xml.etree.ElementTree as ET
context = ET.iterparse('data.xml', events=('end', ))
for event, elem in context:
    if elem.tag == 'row':
        title = elem.find('</Seq-entry>').text
        filename = format(title + ".xml")
        with open(filename, 'wb') as f:
            f.write(ET.tostring(elem))

但它显示了一个奇怪的错误:

Traceback (most recent call last):
  File "s.py", line 3, in <module>
    for event, elem in context:
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1271, in next
    raise e
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

如何纠正错误..请帮忙。

感谢

更新

data.xml中

  <Seq-entry>
                  </User-field>
                  <User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Pseudo Genes (incomplete)</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>100 of 262</User-field_data_str>
                    <User-field_data>
                      <User-field_data_str>16 of 262</User-field_data_str>
                    </User-field_data>
                  </User-field>
                  </User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>StructuredCommentSuffix</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                  <Object-id>
                    <Object-id_str>FeatureFetchPolicy</Object-id_str>
                  </Object-id>
                    <User-field_data>
                      <User-field_data_str>OnlyNearFeatures</User-field_data_str>
                    </User-field_data>
                  </User-field>
                </User-object_data>
              </User-object>
            </Seqdesc_user>
          </Seqdesc>
          <Seqdesc>
            <Seqdesc_user>
              <User-object>
                <User-object_type>
                    </User-field_label>
                    <User-field_num>1</User-field_num>
                        <Object-id_str>Assembly</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_num>1</User-field_num>
                    <User-field_data>
                      <User-field_data_strs>
                   <Object-id>
                    <Object-id_str>StructuredComment</Object-id_str>
                  </Object-id>
                </User-object_type>
                <User-object_data>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>StructuredCommentPrefix</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>##Genome-Assembly-Data-START##</User-field_data_str>
                    </User-field_data>
                  </User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Assembly Method</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>Newbler v. 2.6</User-field_data_str>
                    </User-field_data>
                  </User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Genome Representation</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>Full</User-field_data_str>
                    </User-field_data>
                  </User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Expected Final Version</Object-id_str>              
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>53.0x</User-field_data_str>
                    </User-field_data>
                  </User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Sequencing Technology</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>IonTorrent</User-field_data_str>
                    </User-field_data>
                  </User-field>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>StructuredCommentSuffix</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>##Genome-Assembly-Data-END##</User-field_data_str>
                      </User-field_data>
                         </User-field>
                            </User-object_data>
                               </User-object>
                                  </Seqdesc_user>
                                     </Seqdesc>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Krasnov</Name-std_last>
                                            <Name-std_first>Ya</Name-std_first>
                                            <Name-std_initials>Y.M.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Alkhova</Name-std_last>
                                            <Name-std_first>Zh</Name-std_first>
                                            <Name-std_initials>Z.V.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Shchelkanova</Name-std_last>
                                            <Name-std_first>E</Name-std_first>
                                            <Name-std_initials>E.Y.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Smirnova</Name-std_last>
                                            <Name-std_first>N</Name-std_first>
                                            <Name-std_initials>N.I.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Kutyrev</Name-std_last>
                                            <Name-std_first>V</Name-std_first>
                                            <Name-std_initials>V.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                </Auth-list_names_std>
                              </Auth-list_names>
                              <Auth-list_affil>
                                <Affil>
                                  <Affil_std>
                                    <Affil_std_affil>RARI</Affil_std_affil>
                                    <Affil_std_div>Mikrobiologie</Affil_std_div>
                                    <Affil_std_city>Saratov</Affil_std_city>
                                    <Affil_std_sub>Saratov region</Affil_std_sub>
                                    <Affil_std_country>Russian Federation</Affil_std_country>
                                    <Affil_std_street>Universitetskaya 46</Affil_std_street>
                                    <Affil_std_postal-code>410005</Affil_std_postal-code>
                                  </Affil_std>
                                </Affil>
                              </Auth-list_affil>
                            </Auth-list>
                          </Cit-sub_authors>
                          <Cit-sub_date>
                            <Date>
                              <Date_std>
                                <Date-std>
                                  <Date-std_year>2017</Date-std_year>
                                  <Date-std_month>3</Date-std_month>
                                  <Date-std_day>1</Date-std_day>
                                </Date-std>
                              </Date_std>
                            </Date>
                          </Cit-sub_date>
                        </Cit-sub>
                      </Pub_sub>
                    </Pub>
                  </Pub-equiv>
                </Pubdesc_pub>
              </Pubdesc>
            </Seqdesc_pub>
          </Seqdesc>
          <Seqdesc>
            <Seqdesc_pub>
              <Pubdesc>
                <Pubdesc_pub>
                  <Pub-equiv>
                    <Pub>
                      <Pub_gen>
                        <Cit-gen>
                          <Cit-gen_cit>Unpublished</Cit-gen_cit>
                          <Cit-gen_authors>
                            <Auth-list>
                              <Auth-list_names>
                                <Auth-list_names_std>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Agafonova</Name-std_last>
                                            <Name-std_first>E</Name-std_first>
                                            <Name-std_initials>E.Y.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Krasnov</Name-std_last>
                                            <Name-std_first>Ya</Name-std_first>
                                            <Name-std_initials>Y.M.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Alkhova</Name-std_last>
                                            <Name-std_first>Zh</Name-std_first>
                                            <Name-std_initials>Z.V.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Shchelkanova</Name-std_last>
                                            <Name-std_first>E</Name-std_first>
                                            <Name-std_initials>E.Y.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Smirnova</Name-std_last>
                                            <Name-std_first>N</Name-std_first>
                                            <Name-std_initials>N.I.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                  <Author>
                                    <Author_name>
                                      <Person-id>
                                        <Person-id_name>
                                          <Name-std>
                                            <Name-std_last>Kutyrev</Name-std_last>
                                            <Name-std_first>V</Name-std_first>
                                            <Name-std_initials>V.</Name-std_initials>
                                          </Name-std>
                                        </Person-id_name>
                                      </Person-id>
                                    </Author_name>
                                  </Author>
                                </Auth-list_names_std>
                              </Auth-list_names>
                            </Auth-list>
                          </Cit-gen_authors>
                          <Cit-gen_title>The outbreak of cholera in Mariupol in 2011</Cit-gen_title>
                        </Cit-gen>
                      </Pub_gen>
                    </Pub>
                  </Pub-equiv>
                </Pubdesc_pub>
              </Pubdesc>
            </Seqdesc_pub>
          </Seqdesc>
          <Seqdesc>
            <Seqdesc_source>
              <BioSource>
                <BioSource_genome value="genomic">1</BioSource_genome>
                <BioSource_org>
                  <Org-ref>
                    <Org-ref_taxname>Vibrio cholerae</Org-ref_taxname>
                    <Org-ref_db>
                      <Dbtag>
                        <Dbtag_db>taxon</Dbtag_db>
                        <Dbtag_tag>
                          <Object-id>
                            <Object-id_id>666</Object-id_id>
                          </Object-id>
                        </Dbtag_tag>
                      </Dbtag>
                    </Org-ref_db>
                    <Org-ref_orgname>
                      <OrgName>
                        <OrgName_name>
                          <OrgName_name_binomial>
                            <BinomialOrgName>
                              <BinomialOrgName_genus>Vibrio</BinomialOrgName_genus>
                              <BinomialOrgName_species>cholerae</BinomialOrgName_species>
                            </BinomialOrgName>
                          </OrgName_name_binomial>
                        </OrgName_name>
                                  <User-field_data_str>MWRE01000175</User-field_data_str>
                                </User-field_data>
                              </User-field>
                              <User-field>
                                <User-field_label>
                                  <Object-id>
                                    <Object-id_str>gi</Object-id_str>
                                  </Object-id>
                                </User-field_label>
                                <User-field_data>
                                  <User-field_data_int>1208991974</User-field_data_int>
                                </User-field_data>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Status</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>pipeline</User-field_data_str>
                    </User-field_data>
                  </Object-id>
                </User-object_type>
                <User-object_data>
                  <User-field>
                    <User-field_label>
                      <Object-id>
                        <Object-id_str>Policy</Object-id_str>
                      </Object-id>
                    </User-field_label>
                    <User-field_data>
                      <User-field_data_str>OnlyNearFeatures</User-field_data_str>
                    </User-field_data>
                  </User-field>
                </User-object_data>
              </User-object>
            </Seqdesc_user>
          </Seqdesc>
          <Seqdesc>
            <Seqdesc_update-date>
              <Date>
                <Date_std>
                  <Date-std>
                    <Date-std_year>2017</Date-std_year>
                    <Date-std_month>6</Date-std_month>
                    <Date-std_day>24</Date-std_day>
                  </Date-std>
                </Date_std>
              </Date>
            </Seqdesc_update-date>
          </Seqdesc>
          <Seqdesc>
                            <Seq-interval_strand>
                              <Na-strand value="plus"/>
                            </Seq-interval_strand>
                            <Seq-interval_id>
                              <Seq-id>
                    <Feat-id_local>
                      <Object-id>
                        <Object-id_id>11567</Object-id_id>
                      </Object-id>
                    </Feat-id_local>
                  </Feat-id>
                </Seq-feat_id>
                <Seq-feat_data>
                  <SeqFeatData>
                  </Seq-entry>
                    <SeqFeatData_gene>
                      <Gene-ref>
                        <Gene-ref_locus-tag>B2J70_RS19360</Gene-ref_locus-tag>
                      </Gene-ref>
                    </SeqFeatData_gene>
                  </SeqFeatData>
                </Seq-feat_data>
                <Seq-feat_location>
                  <Seq-loc>
                    <Seq-loc_int>
                      <Object-id>
                        <Object-id_str>ModelEvidence</Object-id_str>
                      </Object-id>
                    </User-object_type>
                    <User-object_data>
                      <User-field>
                     </Seq-entry>
                        <User-field_label>
                          <Object-id>
                            <Object-id_str>Method</Object-id_str>
                  </Feat-id>
                </Seq-feat_id>
                <Seq-feat_data>
                  <SeqFeatData>
                    <SeqFeatData_gene>
                      <Gene-ref>
                        <Gene-ref_locus-tag>B2J70_RS19365</Gene-ref_locus-tag>
                      </Gene-ref>
                    </SeqFeatData_gene>
                  </SeqFeatData>
                </Seq-feat_data>
                <Seq-feat_location>
                  <Seq-loc>
                         </Seq-entry> 
                             <Trna-ext>
                              <Trna-ext_aa>
                                <Trna-ext_aa_ncbieaa>81</Trna-ext_aa_ncbieaa>
                              </Trna-ext_aa>
                              <Trna-ext_anticodon>
                                <Seq-loc>
                                  <Seq-loc_int>
                                    <Seq-interval>
                                      <Seq-interval_from>157</Seq-interval_from>
                                      <Seq-interval_to>159</Seq-interval_to>
                                      <Seq-interval_strand>
                                        <Na-strand value="plus"/>
                                      </Seq-interval_strand>
                                      <Seq-interval_id>
                                        <Seq-id>
                                          <Seq-id_gi>1209940906</Seq-id_gi>
                                        </Seq-id>
                                      </Seq-entry>
                                      </Seq-interval_id>
                                    </Seq-interval>
                                  </Seq-loc_int>
                          </RNA-ref_ext_tRNA>
                  </Gb-qual>
                </Seq-feat_qual>
                <Seq-feat_exts>
                </Seq-entry>
                  <User-object>
                    <User-object_type>
                      <Object-id>
                        <Object-id_str>ModelEvidence</Object-id_str>
                      </Object-id>
                    </User-object_type>
                    </Seq-entry>
                    <User-object_data>
                      <User-field>
                        <User-field_label>
                          <Object-id>
                         </Seq-entry>
        </Seq-annot>
      </Bioseq_annot>
    </Bioseq>
  </Seq-entry_seq>
</Seq-entry>

我使用原始文件创建了此文件,并添加了标记&#34;&#34;随机。

1 个答案:

答案 0 :(得分:1)

首先你的data.xml是无效的xml,所以请把它变成有效的XML。我尝试了一小部分数据xml。

test.xml for test:

<Seq-entry>
<User-field>
    <User-field_data>
        <User-field_data_str>100 of 262</User-field_data_str>
    </User-field_data>
    <User-field_data>
        <User-field_data_str>16 of 262</User-field_data_str>
    </User-field_data>
</User-field>
</Seq-entry> 

test.py:

import xml.etree.ElementTree as ET
context = ET.iterparse('data.xml', events=('end', ))
for event, elem in context:
    if elem.tag == 'User-field_data':
        title = elem.find('User-field_data_str').text
        filename = format(title + ".xml")
        with open(filename, 'wb') as f:
            f.write(ET.tostring(elem))

通过这种方式,您可以将xml拆分到所需的位置。