如何从嵌套的xml结构中获取数据?

时间:2020-10-22 06:43:58

标签: r xml web-scraping nested

我正在尝试使用API​​,该API以嵌套XML的形式提供数据,并且希望将其保存为数据框。我的问题是我不知道如何从嵌套XML中获取值。这是一个示例:

# Sample data
library(xml2)
url <- "https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=xml"
download.file(url, destfile = "xml_data.xml")
fil <- "xml_data.xml"
dat <- xml2::read_xml(fil)

这提供了一个嵌套的xml文件,但我不知道如何使用此结构。

<FullStudiesResponse>
  ....
  <FullStudyList>
    <FullStudy Rank="1">
      <Struct Name="Study">
        <Struct Name="ProtocolSection">
          <Struct Name="IdentificationModule">
            <Field Name="NCTId">NCT01843582</Field>

我可以使用以下命令访问FullStudyList:

xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy")

但是例如,如果我想获取所有NCTIdRank值,该如何引用呢?到目前为止,我已经尝试过

xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/@NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy//NCTId")

显然不起作用。还是有更好的方法与嵌套xml一起使用以在数据框中获取数据?

1 个答案:

答案 0 :(得分:2)

简短的答案是:不要使用XML。该网站的以下文档说,您可以指定所需的fmt。它不必是XML。 JSON在R中更容易处理。

documentaion

尝试一下

library(httr)
library(jsonlite)
library(tibble)

res <- fromJSON(content(GET("https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=json")))

结果是一个嵌套列表,但是我想您对FullStudies中存储的数据感兴趣

df <- as_tibble(res$FullStudiesResponse$FullStudies)

这给了我们

# A tibble: 20 x 2
    Rank Study$ProtocolS~ $$$OrgStudyIdIn~ $$$$OrgStudyIdT~ $$$$OrgStudyIdL~ $$$Organization~ $$$$OrgClass $$$BriefTitle $$$OfficialTitle $$$Acronym $$StatusModule$~
   <int> <chr>            <chr>            <chr>            <chr>            <chr>            <chr>        <chr>         <chr>            <chr>      <chr>           
 1     1 NCT02642055      NEURO+001        NA               NA               Neuro+           INDUSTRY     Efficacy of ~ Efficacy of NEU~ NA         May 2016        
 2     2 NCT01801813      RC12_0416        NA               NA               Nantes Universi~ OTHER        Risk Factors~ Observational S~ Craniosco~ March 2016      
 3     3 NCT03813290      DSRB A/2018/006~ NA               NA               National Health~ OTHER_GOV    A Neuro-Tech~ A Neuro-Technol~ NA         February 2020   
 4     4 NCT03773926      2018-A00604-51   NA               NA               Zeta Technologi~ INDUSTRY     Neuro-feedba~ Neuro-feedback ~ TNTA       December 2018   
 5     5 NCT04189172      AAG-O-H-1630     NA               NA               Aesculap AG      INDUSTRY     MiDura-Study~ Multicenter, In~ MiDura     May 2020        
 6     6 NCT03756337      PIC-20           NA               NA               Oticon Medical   INDUSTRY     Neuro 1 vs. ~ Comparison of A~ NA         November 2018   
 7     7 NCT03484143      P17.03           NA               NA               Vielight Inc.    INDUSTRY     Neuro RX Gam~ Vielight Neuro ~ NA         June 2020       
 8     8 NCT02138110      InVivo-100-101   NA               NA               InVivo Therapeu~ INDUSTRY     The INSPIRE ~ The INSPIRE Stu~ NA         December 2019   
 9     9 NCT03935724      A2017SCI03       NA               NA               Neuroplast       INDUSTRY     Clinical Stu~ A Multi-center,~ SCI2       September 2020  
10    10 NCT03798002      RiphahI Maryam ~ NA               NA               Riphah Internat~ OTHER        Neuro-muscul~ Effects of Neur~ NA         August 2019     
11    11 NCT03655262      R61MH113772      U.S. NIH Grant/~ https://project~ University of C~ OTHER        Treating Pho~ Treating Phobia~ NA         April 2019      
12    12 NCT04418609      Neuro-COVID-19   NA               NA               University of Z~ OTHER        Neuro-COVID-~ Neuro-COVID-19:~ Neuro-COV~ June 2020       
13    13 NCT01174329      1234             NA               NA               Universidad Aut~ OTHER        Treatment of~ Difference in S~ SALELECTR~ July 2010       
14    14 NCT04205019      A2019SCI04       NA               NA               Neuroplast       INDUSTRY     Safety Stem ~ A 3 Months Open~ SSCiSCI    September 2020  
15    15 NCT02941627      PIC_07           NA               NA               Oticon Medical   INDUSTRY     The Neuro Zt~ The Neuro Zti C~ NA         February 2017   
16    16 NCT03328195      P17.02           NA               NA               Vielight Inc.    INDUSTRY     Vielight Neu~ A Pilot Study E~ NA         September 2020  
17    17 NCT02401841      Policlinico 12   NA               NA               Policlinico Hos~ OTHER        Resolution o~ Resolution of N~ NA         October 2015    
18    18 NCT03882567      03/2015          NA               NA               Universidad Rey~ OTHER        Effectivenes~ Effectiveness o~ SCENAR     October 2019    
19    19 NCT04583163      2019-0945        NA               NA               Hackensack Meri~ OTHER        Variability ~ Inter- and Intr~ NA         October 2020    
20    20 NCT01845155      CMTR-TC-02       NA               NA               German Center f~ OTHER        Neuro-Music-~ Neuro-Music-The~ NA         February 2014   
# ... with 103 more variables: $$$OverallStatus <chr>, $$$ExpandedAccessInfo$HasExpandedAccess <chr>, $$$StartDateStruct$StartDate <chr>, $$$$StartDateType <chr>,
#   $$$PrimaryCompletionDateStruct$PrimaryCompletionDate <chr>, $$$$PrimaryCompletionDateType <chr>, $$$CompletionDateStruct$CompletionDate <chr>,
#   $$$$CompletionDateType <chr>, $$$StudyFirstSubmitDate <chr>, $$$StudyFirstSubmitQCDate <chr>, $$$StudyFirstPostDateStruct$StudyFirstPostDate <chr>,
#   $$$$StudyFirstPostDateType <chr>, $$$LastUpdateSubmitDate <chr>, $$$LastUpdatePostDateStruct$LastUpdatePostDate <chr>, $$$$LastUpdatePostDateType <chr>,
#   $$$ResultsFirstSubmitDate <chr>, $$$ResultsFirstSubmitQCDate <chr>, $$$ResultsFirstPostDateStruct$ResultsFirstPostDate <chr>, $$$$ResultsFirstPostDateType <chr>,
#   $$$LastKnownStatus <chr>, $$SponsorCollaboratorsModule$ResponsibleParty$ResponsiblePartyType <chr>, $$$$ResponsiblePartyInvestigatorFullName <chr>,
#   $$$$ResponsiblePartyInvestigatorTitle <chr>, $$$$ResponsiblePartyInvestigatorAffiliation <chr>, $$$$ResponsiblePartyOldNameTitle <chr>,
#   $$$$ResponsiblePartyOldOrganization <chr>, $$$LeadSponsor$LeadSponsorName <chr>, $$$$LeadSponsorClass <chr>, $$$CollaboratorList$Collaborator <list>,
#   $$OversightModule$OversightHasDMC <chr>, $$$IsFDARegulatedDrug <chr>, $$$IsFDARegulatedDevice <chr>, $$$IsUnapprovedDevice <chr>, $$$IsUSExport <chr>,
#   $$DescriptionModule$BriefSummary <chr>, $$$DetailedDescription <chr>, $$ConditionsModule$ConditionList$Condition <list>, $$$KeywordList$Keyword <list>,
#   $$DesignModule$StudyType <chr>, $$$PhaseList$Phase <list>, $$$DesignInfo$DesignAllocation <chr>, $$$$DesignInterventionModel <chr>,
#   $$$$DesignPrimaryPurpose <chr>, $$$$DesignMaskingInfo$DesignMasking <chr>, $$$$$DesignWhoMaskedList$DesignWhoMasked <list>, $$$$$DesignMaskingDescription <chr>,
#   $$$$DesignObservationalModelList$DesignObservationalModel <list>, $$$$DesignTimePerspectiveList$DesignTimePerspective <list>,
#   $$$$DesignInterventionModelDescription <chr>, $$$EnrollmentInfo$EnrollmentCount <chr>, $$$$EnrollmentType <chr>, $$$PatientRegistry <chr>,
#   $$$TargetDuration <chr>, $$ArmsInterventionsModule$ArmGroupList$ArmGroup <list>, $$$InterventionList$Intervention <list>,
#   $$OutcomesModule$PrimaryOutcomeList$PrimaryOutcome <list>, $$$SecondaryOutcomeList$SecondaryOutcome <list>, $$$OtherOutcomeList$OtherOutcome <list>,
#   $$EligibilityModule$EligibilityCriteria <chr>, $$$HealthyVolunteers <chr>, $$$Gender <chr>, $$$MinimumAge <chr>, $$$MaximumAge <chr>, $$$StdAgeList$StdAge <list>,
#   $$$StudyPopulation <chr>, $$$SamplingMethod <chr>, $$ContactsLocationsModule$OverallOfficialList$OverallOfficial <list>, $$$LocationList$Location <list>,
#   $$$CentralContactList$CentralContact <list>, $$IPDSharingStatementModule$IPDSharing <chr>, $$ReferencesModule$ReferenceList$Reference <list>,
#   $$$SeeAlsoLinkList$SeeAlsoLink <list>, $DerivedSection$MiscInfoModule$VersionHolder <chr>, $$$RemovedCountryList$RemovedCountry <list>,
#   $$ConditionBrowseModule$ConditionMeshList$ConditionMesh <list>, $$$ConditionAncestorList$ConditionAncestor <list>,
#   $$$ConditionBrowseLeafList$ConditionBrowseLeaf <list>, $$$ConditionBrowseBranchList$ConditionBrowseBranch <list>,
#   $$InterventionBrowseModule$InterventionBrowseLeafList$InterventionBrowseLeaf <list>, $$$InterventionBrowseBranchList$InterventionBrowseBranch <list>,
#   $ResultsSection$ParticipantFlowModule$FlowGroupList$FlowGroup <list>, $$$FlowPeriodList$FlowPeriod <list>, $$$FlowPreAssignmentDetails <chr>,
#   $$$FlowRecruitmentDetails <chr>, $$BaselineCharacteristicsModule$BaselinePopulationDescription <chr>, $$$BaselineGroupList$BaselineGroup <list>,
#   $$$BaselineDenomList$BaselineDenom <list>, $$$BaselineMeasureList$BaselineMeasure <list>, $$OutcomeMeasuresModule$OutcomeMeasureList$OutcomeMeasure <list>,
#   $$AdverseEventsModule$EventsFrequencyThreshold <chr>, $$$EventsTimeFrame <chr>, $$$EventGroupList$EventGroup <list>, $$$SeriousEventList$SeriousEvent <list>,
#   $$$OtherEventList$OtherEvent <list>, $$MoreInfoModule$CertainAgreement$AgreementPISponsorEmployee <chr>, $$$$AgreementRestrictiveAgreement <chr>,
#   $$$PointOfContact$PointOfContactTitle <chr>, $$$$PointOfContactOrganization <chr>, $$$$PointOfContactEMail <chr>, $$$$PointOfContactPhone <chr>, ...