<?xml version="1.0"?>
<BioSampleSet>
<BioSample accession="SAMN01347139" id="1347139" submission_date="2012-09-21T22:44:26.843" last_update="2012-09-21T22:44:26.843" publication_date="2012-09-21T22:44:26.843" access="controlled-access">
<Ids>
<Id is_primary="1" db="BioSample">SAMN01347139</Id>
<Id db="dbGaP" is_hidden="1" db_label="Sample name">44-21834</Id>
</Ids>
<Description>
<Title>DNA sample from a human male participant in the dbGaP study "Framingham SHARe Thyroid and Hormone Data"</Title>
<Organism taxonomy_name="Homo sapiens" taxonomy_id="9606"/>
</Description>
<Owner>
<Name abbreviation="NCBI"/>
</Owner>
<Models>
<Model>Generic</Model>
</Models>
<Package display_name="Generic">Generic.1.0</Package>
<Attributes>
<Attribute display_name="gap accession" harmonized_name="gap_accession" attribute_name="gap_accession">phs000044</Attribute>
<Attribute display_name="submitter handle" harmonized_name="submitter_handle" attribute_name="submitter handle">Framingham_SHARe</Attribute>
<Attribute display_name="biospecimen repository" harmonized_name="biospecimen_repository" attribute_name="biospecimen repository">Framingham_SHARe</Attribute>
<Attribute display_name="study name" harmonized_name="study_name" attribute_name="study name">Framingham SHARe Thyroid and Hormone Data</Attribute>
<Attribute display_name="biospecimen repository sample id" harmonized_name="biospecimen_repository_sample_id" attribute_name="biospecimen repository sample id">21834</Attribute>
<Attribute display_name="submitted sample id" harmonized_name="submitted_sample_id" attribute_name="submitted sample id">21834</Attribute>
<Attribute display_name="submitted subject id" harmonized_name="submitted_subject_id" attribute_name="submitted subject id">21834</Attribute>
<Attribute display_name="gap sample id" harmonized_name="gap_sample_id" attribute_name="gap_sample_id">105542</Attribute>
<Attribute display_name="gap subject id" harmonized_name="gap_subject_id" attribute_name="gap_subject_id">28577</Attribute>
<Attribute display_name="sex" harmonized_name="sex" attribute_name="sex">male</Attribute>
<Attribute display_name="analyte type" harmonized_name="analyte_type" attribute_name="analyte type">DNA</Attribute>
<Attribute display_name="subject is affected" harmonized_name="subject_is_affected" attribute_name="subject is affected"/>
<Attribute display_name="gap consent code" harmonized_name="gap_consent_code" attribute_name="gap_consent_code">1</Attribute>
<Attribute display_name="gap consent short name" harmonized_name="gap_consent_short_name" attribute_name="gap_consent_short_name">GRU</Attribute>
</Attributes>
<Status when="2012-09-21T22:44:26.843" status="suppressed"/>
</BioSample>
</BioSampleSet>
我想以编程方式解析上面给出的xml文件。我尝试使用lxml,但在提取<Attributes>
标记中的键和值时遇到问题,因为所有子标记都被命名为属性。任何人都有任何建议。
我尝试使用“属性”作为正则表达式来拆分文本,但由于整个文件是一行,因此结果列表是指定部分的字母表列表。
我正在使用python。 <Attribute>
标签的数量可能会不时变化。
我目前正在使用以下代码:
from lxml import objectify
import Bio.Entrez as Entrez
meta_data = Entrez.efetch(db="biosample",id=sra_id, rettype="runinfo").read()
tree = objectify.fromstring(meta_data)
print(tree.BioSample.Attributes.submitter_handle)