如何从XML文件中选择特定信息?在R或其他平台上

时间:2018-11-09 14:29:10

标签: r xml

您好,我刚刚从NCBI下载了一个指向埃及伊蚊5.8S区域的XML文件-核苷酸。作为示例,我将在第一个示例中获得的信息粘贴到文本中。

我想从这里提取
1. <INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
2. <INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
3. <INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
4. <INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA </INSDReference_journal>

而且,正如我所说的,这是我实际下载的所有信息的简短版本(13个样本)https://www.ncbi.nlm.nih.gov/nuccore/?term=aedes+aegypti+5.8,是否有可能提取所有样本所需的信息?
我对R很熟悉,但是哪个平台套件更好地做到这一点?

<INSDSeq_locus>CH477247</INSDSeq_locus>
<INSDSeq_length>3065330</INSDSeq_length>
<INSDSeq_strandedness>double</INSDSeq_strandedness>
<INSDSeq_moltype>DNA</INSDSeq_moltype>
<INSDSeq_topology>linear</INSDSeq_topology>
<INSDSeq_division>CON</INSDSeq_division>
<INSDSeq_update-date>23-MAR-2015</INSDSeq_update-date>
<INSDSeq_create-date>28-OCT-2005</INSDSeq_create-date>
<INSDSeq_definition>Aedes aegypti strain Liverpool supercont1.62 genomic scaffold, whole genome shotgun sequence</INSDSeq_definition>
<INSDSeq_primary-accession>CH477247</INSDSeq_primary-accession>
<INSDSeq_accession-version>CH477247.1</INSDSeq_accession-version>
<INSDSeq_other-seqids>
  <INSDSeqid>gnl|WGS:AAGE|supercont1.62</INSDSeqid>
  <INSDSeqid>gb|CH477247.1|</INSDSeqid>
  <INSDSeqid>gi|78216626</INSDSeqid>
</INSDSeq_other-seqids>
<INSDSeq_project>PRJNA12434</INSDSeq_project>
<INSDSeq_keywords>
  <INSDKeyword>WGS</INSDKeyword>
</INSDSeq_keywords>
<INSDSeq_source>Aedes aegypti (yellow fever mosquito)</INSDSeq_source>
<INSDSeq_organism>Aedes aegypti</INSDSeq_organism>
<INSDSeq_taxonomy>Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Holometabola; Diptera; Nematocera; Culicoidea; Culicidae; Culicinae; Aedini; Aedes; Stegomyia</INSDSeq_taxonomy>
<INSDSeq_references>
  <INSDReference>
    <INSDReference_reference>1</INSDReference_reference>
    <INSDReference_position>1..3065330</INSDReference_position>
    <INSDReference_authors>
      <INSDAuthor>Nene,V.</INSDAuthor>
      <INSDAuthor>Wortman,J.R.</INSDAuthor>
      <INSDAuthor>Lawson,D.</INSDAuthor>
      <INSDAuthor>Haas,B.</INSDAuthor>
      <INSDAuthor>Kodira,C.</INSDAuthor>
      <INSDAuthor>Tu,Z.J.</INSDAuthor>
      <INSDAuthor>Loftus,B.</INSDAuthor>
      <INSDAuthor>Xi,Z.</INSDAuthor>
      <INSDAuthor>Megy,K.</INSDAuthor>
      <INSDAuthor>Grabherr,M.</INSDAuthor>
      <INSDAuthor>Ren,Q.</INSDAuthor>
      <INSDAuthor>Zdobnov,E.M.</INSDAuthor>
      <INSDAuthor>Lobo,N.F.</INSDAuthor>
      <INSDAuthor>Campbell,K.S.</INSDAuthor>
      <INSDAuthor>Brown,S.E.</INSDAuthor>
      <INSDAuthor>Bonaldo,M.F.</INSDAuthor>
      <INSDAuthor>Zhu,J.</INSDAuthor>
      <INSDAuthor>Sinkins,S.P.</INSDAuthor>
      <INSDAuthor>Hogenkamp,D.G.</INSDAuthor>
      <INSDAuthor>Amedeo,P.</INSDAuthor>
      <INSDAuthor>Arensburger,P.</INSDAuthor>
      <INSDAuthor>Atkinson,P.W.</INSDAuthor>
      <INSDAuthor>Bidwell,S.</INSDAuthor>
      <INSDAuthor>Biedler,J.</INSDAuthor>
      <INSDAuthor>Birney,E.</INSDAuthor>
      <INSDAuthor>Bruggner,R.V.</INSDAuthor>
      <INSDAuthor>Costas,J.</INSDAuthor>
      <INSDAuthor>Coy,M.R.</INSDAuthor>
      <INSDAuthor>Crabtree,J.</INSDAuthor>
      <INSDAuthor>Crawford,M.</INSDAuthor>
      <INSDAuthor>Debruyn,B.</INSDAuthor>
      <INSDAuthor>Decaprio,D.</INSDAuthor>
      <INSDAuthor>Eiglmeier,K.</INSDAuthor>
      <INSDAuthor>Eisenstadt,E.</INSDAuthor>
      <INSDAuthor>El-Dorry,H.</INSDAuthor>
      <INSDAuthor>Gelbart,W.M.</INSDAuthor>
      <INSDAuthor>Gomes,S.L.</INSDAuthor>
      <INSDAuthor>Hammond,M.</INSDAuthor>
      <INSDAuthor>Hannick,L.I.</INSDAuthor>
      <INSDAuthor>Hogan,J.R.</INSDAuthor>
      <INSDAuthor>Holmes,M.H.</INSDAuthor>
      <INSDAuthor>Jaffe,D.</INSDAuthor>
      <INSDAuthor>Johnston,J.S.</INSDAuthor>
      <INSDAuthor>Kennedy,R.C.</INSDAuthor>
      <INSDAuthor>Koo,H.</INSDAuthor>
      <INSDAuthor>Kravitz,S.</INSDAuthor>
      <INSDAuthor>Kriventseva,E.V.</INSDAuthor>
      <INSDAuthor>Kulp,D.</INSDAuthor>
      <INSDAuthor>Labutti,K.</INSDAuthor>
      <INSDAuthor>Lee,E.</INSDAuthor>
      <INSDAuthor>Li,S.</INSDAuthor>
      <INSDAuthor>Lovin,D.D.</INSDAuthor>
      <INSDAuthor>Mao,C.</INSDAuthor>
      <INSDAuthor>Mauceli,E.</INSDAuthor>
      <INSDAuthor>Menck,C.F.</INSDAuthor>
      <INSDAuthor>Miller,J.R.</INSDAuthor>
      <INSDAuthor>Montgomery,P.</INSDAuthor>
      <INSDAuthor>Mori,A.</INSDAuthor>
      <INSDAuthor>Nascimento,A.L.</INSDAuthor>
      <INSDAuthor>Naveira,H.F.</INSDAuthor>
      <INSDAuthor>Nusbaum,C.</INSDAuthor>
      <INSDAuthor>O&apos;leary,S.</INSDAuthor>
      <INSDAuthor>Orvis,J.</INSDAuthor>
      <INSDAuthor>Pertea,M.</INSDAuthor>
      <INSDAuthor>Quesneville,H.</INSDAuthor>
      <INSDAuthor>Reidenbach,K.R.</INSDAuthor>
      <INSDAuthor>Rogers,Y.H.</INSDAuthor>
      <INSDAuthor>Roth,C.W.</INSDAuthor>
      <INSDAuthor>Schneider,J.R.</INSDAuthor>
      <INSDAuthor>Schatz,M.</INSDAuthor>
      <INSDAuthor>Shumway,M.</INSDAuthor>
      <INSDAuthor>Stanke,M.</INSDAuthor>
      <INSDAuthor>Stinson,E.O.</INSDAuthor>
      <INSDAuthor>Tubio,J.M.</INSDAuthor>
      <INSDAuthor>Vanzee,J.P.</INSDAuthor>
      <INSDAuthor>Verjovski-Almeida,S.</INSDAuthor>
      <INSDAuthor>Werner,D.</INSDAuthor>
      <INSDAuthor>White,O.</INSDAuthor>
      <INSDAuthor>Wyder,S.</INSDAuthor>
      <INSDAuthor>Zeng,Q.</INSDAuthor>
      <INSDAuthor>Zhao,Q.</INSDAuthor>
      <INSDAuthor>Zhao,Y.</INSDAuthor>
      <INSDAuthor>Hill,C.A.</INSDAuthor>
      <INSDAuthor>Raikhel,A.S.</INSDAuthor>
      <INSDAuthor>Soares,M.B.</INSDAuthor>
      <INSDAuthor>Knudson,D.L.</INSDAuthor>
      <INSDAuthor>Lee,N.H.</INSDAuthor>
      <INSDAuthor>Galagan,J.</INSDAuthor>
      <INSDAuthor>Salzberg,S.L.</INSDAuthor>
      <INSDAuthor>Paulsen,I.T.</INSDAuthor>
      <INSDAuthor>Dimopoulos,G.</INSDAuthor>
      <INSDAuthor>Collins,F.H.</INSDAuthor>
      <INSDAuthor>Birren,B.</INSDAuthor>
      <INSDAuthor>Fraser-Liggett,C.M.</INSDAuthor>
      <INSDAuthor>Severson,D.W.</INSDAuthor>
    </INSDReference_authors>
    <INSDReference_title>Genome sequence of Aedes aegypti, a major arbovirus vector</INSDReference_title>
    <INSDReference_journal>Science 316 (5832), 1718-1723 (2007)</INSDReference_journal>
    <INSDReference_xref>
      <INSDXref>
        <INSDXref_dbname>doi</INSDXref_dbname>
        <INSDXref_id>10.1126/science.1138878</INSDXref_id>
      </INSDXref>
    </INSDReference_xref>
    <INSDReference_pubmed>17510324</INSDReference_pubmed>
  </INSDReference>
  <INSDReference>
    <INSDReference_reference>2</INSDReference_reference>
    <INSDReference_position>1..3065330</INSDReference_position>
    <INSDReference_authors>
      <INSDAuthor>Galagan,J.</INSDAuthor>
      <INSDAuthor>Devon,K.</INSDAuthor>
      <INSDAuthor>Henn,M.R.</INSDAuthor>
      <INSDAuthor>Severson,D.W.</INSDAuthor>
      <INSDAuthor>Collins,F.</INSDAuthor>
      <INSDAuthor>Jaffe,D.</INSDAuthor>
      <INSDAuthor>Rounsley,S.</INSDAuthor>
      <INSDAuthor>DeCaprio,D.</INSDAuthor>
      <INSDAuthor>Kodira,C.</INSDAuthor>
      <INSDAuthor>Lander,E.</INSDAuthor>
      <INSDAuthor>Crawford,M.</INSDAuthor>
      <INSDAuthor>Butler,J.</INSDAuthor>
      <INSDAuthor>Alvarez,P.</INSDAuthor>
      <INSDAuthor>Gnerre,S.</INSDAuthor>
      <INSDAuthor>Grabherr,M.</INSDAuthor>
      <INSDAuthor>Kleber,M.</INSDAuthor>
      <INSDAuthor>Mauceli,E.</INSDAuthor>
      <INSDAuthor>Brockman,W.</INSDAuthor>
      <INSDAuthor>Young,S.</INSDAuthor>
      <INSDAuthor>LaButti,K.</INSDAuthor>
      <INSDAuthor>Pushparaj,V.</INSDAuthor>
      <INSDAuthor>Koehrsen,M.</INSDAuthor>
      <INSDAuthor>Engels,R.</INSDAuthor>
      <INSDAuthor>Montgomery,P.</INSDAuthor>
      <INSDAuthor>Pearson,M.</INSDAuthor>
      <INSDAuthor>Howarth,C.</INSDAuthor>
      <INSDAuthor>Zeng,Q.</INSDAuthor>
      <INSDAuthor>Yandava,C.</INSDAuthor>
      <INSDAuthor>Oleary,S.</INSDAuthor>
      <INSDAuthor>Alvarado,L.</INSDAuthor>
      <INSDAuthor>Nusbaum,C.</INSDAuthor>
      <INSDAuthor>Birren,B.</INSDAuthor>
    </INSDReference_authors>
    <INSDReference_consortium>The Broad Institute Genome Sequencing Platform</INSDReference_consortium>
    <INSDReference_title>Direct Submission</INSDReference_title>
    <INSDReference_journal>Submitted (07-OCT-2005) Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, MA 02141, USA</INSDReference_journal>
  </INSDReference>
  <INSDReference>
    <INSDReference_reference>3</INSDReference_reference>
    <INSDReference_position>1..3065330</INSDReference_position>
    <INSDReference_authors>
      <INSDAuthor>Loftus,B.J.</INSDAuthor>
      <INSDAuthor>Nene,V.M.</INSDAuthor>
      <INSDAuthor>Hannick,L.I.</INSDAuthor>
      <INSDAuthor>Bidwell,S.</INSDAuthor>
      <INSDAuthor>Haas,B.</INSDAuthor>
      <INSDAuthor>Amedeo,P.</INSDAuthor>
      <INSDAuthor>Orvis,J.</INSDAuthor>
      <INSDAuthor>Wortman,J.R.</INSDAuthor>
      <INSDAuthor>White,O.R.</INSDAuthor>
      <INSDAuthor>Salzberg,S.</INSDAuthor>
      <INSDAuthor>Shumway,M.</INSDAuthor>
      <INSDAuthor>Koo,H.</INSDAuthor>
      <INSDAuthor>Zhao,Y.</INSDAuthor>
      <INSDAuthor>Holmes,M.</INSDAuthor>
      <INSDAuthor>Miller,J.</INSDAuthor>
      <INSDAuthor>Schatz,M.</INSDAuthor>
      <INSDAuthor>Pop,M.</INSDAuthor>
      <INSDAuthor>Pai,G.</INSDAuthor>
      <INSDAuthor>Utterback,T.</INSDAuthor>
      <INSDAuthor>Rogers,Y.-H.</INSDAuthor>
      <INSDAuthor>Kravitz,S.</INSDAuthor>
      <INSDAuthor>Fraser,C.M.</INSDAuthor>
    </INSDReference_authors>
    <INSDReference_title>Direct Submission</INSDReference_title>
    <INSDReference_journal>Submitted (07-OCT-2005) The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA</INSDReference_journal>
  </INSDReference>
  <INSDReference>
    <INSDReference_reference>4</INSDReference_reference>
    <INSDReference_position>1..3065330</INSDReference_position>
    <INSDReference_consortium>VectorBase</INSDReference_consortium>
    <INSDReference_title>Direct Submission</INSDReference_title>
    <INSDReference_journal>Submitted (05-SEP-2012) VectorBase / Ensembl, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK</INSDReference_journal>
    <INSDReference_remark>Annotation update by submitter</INSDReference_remark>
  </INSDReference>
</INSDSeq_references>
<INSDSeq_comment>The sequence for this assembly was produced jointly by The Broad Institute of Harvard/MIT and The Institute for Genomic Research. The assembly represents 7.6X sequence coverage of the genome and the total length of the contigs is 1.31 Gb. Additional information about the Aedes aegypti sequencing project and assembly can be found at http://www.broad.mit.edu/annotation/disease_vector/aedes_aegypti/ and http://www.tigr.org/msc/aedes/aedes.shtml. Long-term curation of the sequence and subsequent annotation updates will be the responsibility of VectorBase at http://www.vectorbase.org.~Annotation was updated by VectorBase in Sept 2012.</INSDSeq_comment>
<INSDSeq_feature-table>
  <INSDFeature>
    <INSDFeature_key>source</INSDFeature_key>
    <INSDFeature_location>1..3065330</INSDFeature_location>
    <INSDFeature_intervals>
      <INSDInterval>
        <INSDInterval_from>1</INSDInterval_from>
        <INSDInterval_to>3065330</INSDInterval_to>
        <INSDInterval_accession>CH477247.1</INSDInterval_accession>
      </INSDInterval>
    </INSDFeature_intervals>
    <INSDFeature_quals>
      <INSDQualifier>
        <INSDQualifier_name>organism</INSDQualifier_name>
        <INSDQualifier_value>Aedes aegypti</INSDQualifier_value>
      </INSDQualifier>
      <INSDQualifier>
        <INSDQualifier_name>mol_type</INSDQualifier_name>
        <INSDQualifier_value>genomic DNA</INSDQualifier_value>
      </INSDQualifier>
      <INSDQualifier>
        <INSDQualifier_name>strain</INSDQualifier_name>
        <INSDQualifier_value>Liverpool</INSDQualifier_value>
      </INSDQualifier>
      <INSDQualifier>
        <INSDQualifier_name>db_xref</INSDQualifier_name>
        <INSDQualifier_value>taxon:7159</INSDQualifier_value>
      </INSDQualifier>
      <INSDQualifier>
        <INSDQualifier_name>chromosome</INSDQualifier_name>
        <INSDQualifier_value>2</INSDQualifier_value>
      </INSDQualifier>
    </INSDFeature_quals>
  </INSDFeature>
</INSDSeq_feature-table>
<INSDSeq_contig>join(AAGE02003964.1:1..7226,gap(unk100),AAGE02003965.1:1..6376,gap(unk100),AAGE02003966.1:1..16236,gap(4301),AAGE02003967.1:1..174188,gap(unk100),AAGE02003968.1:1..24199,gap(1396),AAGE02003969.1:1..104064,gap(29770),AAGE02003970.1:1..12303,gap(56956),AAGE02003971.1:1..2368,gap(12542),AAGE02003972.1:1..29888,gap(1379),AAGE02003973.1:1..98175,gap(unk100),AAGE02003974.1:1..13180,gap(unk100),AAGE02003975.1:1..2872,gap(unk100),AAGE02003976.1:1..18626,gap(unk100),AAGE02003977.1:1..52378,gap(151),AAGE02003978.1:1..153108,gap(901),AAGE02003979.1:1..3583,gap(unk100),AAGE02003980.1:1..32852,gap(unk100),AAGE02003981.1:1..68239,gap(unk100),AAGE02003982.1:1..61056,gap(unk100),AAGE02003983.1:1..21852,gap(unk100),AAGE02003984.1:1..49659,gap(unk100),AAGE02003985.1:1..33070,gap(315),AAGE02003986.1:1..411266,gap(unk100),AAGE02003987.1:1..2985,gap(unk100),AAGE02003988.1:1..38365,gap(159),AAGE02003989.1:1..110697,gap(890),AAGE02003990.1:1..22405,gap(2299),AAGE02003991.1:1..7510,gap(187),AAGE02003992.1:1..447937,gap(263),AAGE02003993.1:1..92770,gap(1409),AAGE02003994.1:1..2258,gap(132),AAGE02003995.1:1..5605,gap(unk100),AAGE02003996.1:1..3451,gap(2717),AAGE02003997.1:1..20215,gap(unk100),AAGE02003998.1:1..35683,gap(514),AAGE02003999.1:1..307288,gap(unk100),AAGE02004000.1:1..71359,gap(433),AAGE02004001.1:1..10550,gap(unk100),AAGE02004002.1:1..289125,gap(unk100),AAGE02004003.1:1..45622,gap(unk100),AAGE02004004.1:1..35927)</INSDSeq_contig>
<INSDSeq_xrefs>
  <INSDXref>
    <INSDXref_dbname>BioProject</INSDXref_dbname>
    <INSDXref_id>PRJNA12434</INSDXref_id>
  </INSDXref>
  <INSDXref>
    <INSDXref_dbname>BioSample</INSDXref_dbname>
    <INSDXref_id>SAMN02953616</INSDXref_id>
  </INSDXref>
</INSDSeq_xrefs>

`

1 个答案:

答案 0 :(得分:0)

使用xpath或CSS选择器。

取决于您使用的语言和库。