我正在使用xml2包将XML文件转换为CSV。我正在处理的XML具有以下结构-注意<businessAddress>
仅出现在两个<business>
节点中。
<businesses>
<business>
<businessName>...</businessName>
<businessAddress>...</businessAddress>
<businessPostcode>...</businessPostcode>
</business>
<business>
<businessName>...</businessName>
<businessAddress>...</businessAddress>
<businessPostcode>...</businessPostcode>
</business>
<business>
<businessName>...</businessName>
<businessPostcode>...</businessPostcode>
</business>
</businesses>
我的R看起来像这样:
data <- read_xml("/path/to/the/xml")
businessName_nodes <- xml_find_all(data, "//businessName")
businessName <- xml_text(businessName_nodes)
businessAddress_nodes <- xml_find_all(data, "//businessAddress")
businessAddress <- xml_text(businessAddress_nodes)
businessPostcode_nodes <- xml_find_all(data, "//businessPostcode")
businessPostcode <- xml_text(businessPostcode_nodes)
framedData = data.frame(
businessName,
businessAddress,
businessPostcode,
stringsAsFactors = FALSE)
write.csv(framedData, file = csvName)
这给我一个Error in data.frame... arguments imply differing number of rows
错误,因为不是每个<business>
都包含一个<businessAddress>
。
我的问题是如何解决这个问题,所以我可以得到一个框架,当不存在<businessAddress>
时,将创建一个空值
"", "businessName", "businessAddress", "businessPostcode"
9123, "Bob Smith", NA, "M1R 0E9"
否则,该数据框中的行根本不会创建。
R的新手,感谢您的帮助
答案 0 :(得分:1)
您可以使用xml2包中的xml_find_first
...在业务节点上循环,如果未找到xpath-match,则结果为NA
。
样本数据
xmlText <- "<businesses>
<business>
<businessName>...</businessName>
<businessAddress>...</businessAddress>
<businessPostcode>...</businessPostcode>
</business>
<business>
<businessName>...</businessName>
<businessAddress>...</businessAddress>
<businessPostcode>...</businessPostcode>
</business>
<business>
<businessName>...</businessName>
<businessPostcode>...</businessPostcode>
</business>
</businesses>"
代码
library( xml2 )
library( magrittr ) #for the pipe symbol
doc <- read_xml( xmlText )
business_nodes <- xml_find_all( doc, ".//business" )
data.frame(
businessName = xml_find_first( business_nodes, ".//businessName" ) %>% xml_text(),
businessAddress = xml_find_first( business_nodes, ".//businessAddress" ) %>% xml_text(),
businessPostcode = xml_find_first( business_nodes, ".//businessPostcode" ) %>% xml_text(),
stringsAsFactors = FALSE )
# businessName businessAddress businessPostcode
# 1 ... ... ...
# 2 ... ... ...
# 3 ... <NA> ...