将XML转换为数据帧时获取空数据帧

时间:2017-04-11 20:53:06

标签: r xml

我有一个XML结构如下(它只是一个子集):

<rss version="2.0">
<channel>
<title>Marketwired - Medical and Healthcare</title>
<link>http://www.marketwired.com</link>
<description>Marketwired - Medical and Healthcare</description>
<language>en</language>
<copyright>Copyright: (C) Marketwired</copyright>
<lastBuildDate>Tue, 11 Apr 2017 15:23:48 EDT</lastBuildDate>
<ttl>1</ttl>
<image>
<title>Marketwired.com</title>
<url>http://www.marketwired.com/rss/marketwire_logo.jpg</url>
<link>http://www.marketwired.com</link>
</image>
<item>
<title>
American Academy of Dermatology: Tips to Prevent and Treat Bug Bites
</title>
<link>
http://www.marketwired.com/mw/release.do?id=2209171&sourceType=3
</link>
<description>
<div class="mw_release"> <p>SCHAUMBURG, IL--(Marketwired - Apr 11, 2017) - Although warm, spring weather means more time outdoors, it also means more bugs -- like bees, ticks and mosquitoes. The best way to deal with pesky bites and stings, say dermatologists from the American Academy of Dermatology, is to prevent them in the first place. This can also help you avoid an insect-related disease, which can put a damper on anyone's spring.
</description>
<pubDate>Tue, 11 Apr 2017 14:00:00 EDT</pubDate>
</item>
</channel>
</rss>

当我尝试将上述结构转换为数据帧时,我使用以下代码行获得0行和列:

link <- "http://www.marketwire.com/rss/mwMedicalandHealthcare.xml"
xmlfeed<-xmlParse(link,asText=FALSE)
xmldata<-xmlToDataFrame(nodes=getNodeSet(xmlfeed,"rss/channel/item"),stringsAsFactors = FALSE)

那么问题是什么呢?我正在尝试阅读item标签下的标题和说明。您可以从上面提到的链接中查看整个XML文件。

1 个答案:

答案 0 :(得分:1)

基本上你的xpath表达式略有偏差,因为你需要从root:/rss/channel/item使用正斜杠引导。但是,使用您的特定XML,由于重复<category>元素将导致重复的列别名,因此无法生成数据帧。

考虑使用xpathSApply方法选择一致的元素,例如 title link

xmldata <- data.frame(
  title = xpathSApply(xmlfeed, "/rss/channel/item/title", xmlValue),
  link = xpathSApply(xmlfeed, "/rss/channel/item/link", xmlValue), 
  stringsAsFactors = FALSE
)

现在,如果您确实需要每个类别,请考虑绑定到 category1 category2 category3 字段循环遍历节点的数量。具体来说,如果不存在这样的元素,请使用xpath&#39; concat函数返回零长度字符串:

no_items <- length(getNodeSet(xmlfeed,"/rss/channel/item"))

dfs <- lapply(seq(no_items), function(i){
  data.frame(
    title = xpathSApply(xmlfeed, paste0("/rss/channel/item[",i,"]/title"), xmlValue),
    link = xpathSApply(xmlfeed, paste0("/rss/channel/item[",i,"]/link"), xmlValue), 
    category1 = xpathSApply(xmlfeed, paste0("concat(/rss/channel/item[",i,"]/category[1], '')"), xmlValue), 
    category2= xpathSApply(xmlfeed, paste0("concat(/rss/channel/item[",i,"]/category[2], '')"), xmlValue), 
    category3= xpathSApply(xmlfeed, paste0("concat(/rss/channel/item[",i,"]/category[3], '')"), xmlValue), 
    category4= xpathSApply(xmlfeed, paste0("concat(/rss/channel/item[",i,"]/category[4], '')"), xmlValue), 
    category5= xpathSApply(xmlfeed, paste0("concat(/rss/channel/item[",i,"]/category[5], '')"), xmlValue), 
    stringsAsFactors = FALSE
  )      
})

xmldata <- do.call(rbind, dfs)

<强>输出

head(xmldf)                                                                                                                               

    # title
# 1 Morehouse Instrument Finds Perfect Balance Between Field Convenience And Laboratory Grade Precision with New Portable Force Calibrator
# 2          Cura-Can Health Corp. Secures Right to Acquire Assets of The Clinic Network Inc. and Acquires Assets of Healthnet Enterprises
# 3                                         Esterline Selects Jason Childs as President for Control & Communication Systems Business Group
# 4                                      ASAPS 2017 San Diego: Rosemont Media CEO to Lecture on Reputation Management for Plastic Surgeons
# 5                             Experts Discuss HIV/HCV Screening Program at Homestead Hospital on National Youth HIV & AIDS Awareness Day
# 6                                                                   American Academy of Dermatology: Tips to Prevent and Treat Bug Bites

#                                                               link category1 category2 category3 category4 category6
# 1 http://www.marketwired.com/mw/release.do?id=2209199&sourceType=3                                                  
# 2 http://www.marketwired.com/mw/release.do?id=2209198&sourceType=3                                                  
# 3 http://www.marketwired.com/mw/release.do?id=2209190&sourceType=3  NYSE:ESL                                        
# 4 http://www.marketwired.com/mw/release.do?id=2209184&sourceType=3                                                  
# 5 http://www.marketwired.com/mw/release.do?id=2209183&sourceType=3                                                  
# 6 http://www.marketwired.com/mw/release.do?id=2209171&sourceType=3