从XML属性到R中的data.frame

时间:2015-10-01 21:06:02

标签: xml r dataframe

我有一个包含如下数据的XML:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" 
       AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
       ViewCount="1647" Body="some text;" OwnerUserId="8" 
       LastActivityDate="2010-09-15T21:08:26.077" 
       Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
[...]

(数据集是dump from stats.stackexchange.com

如何获取具有属性&#34; Id&#34;的data.frame;和&#34; PostTypeId&#34;?

我一直在尝试使用XML库,但我发现我不知道如何解开这些值:

library(XML)

xml <- xmlTreeParse("Posts.xml",useInternalNode=TRUE)
types <- getNodeSet(xml, '//row/@PostTypeId')

> types[1]
[[1]]
PostTypeId 
       "1" 
attr(,"class")
[1] "XMLAttributeValue"

将这两列中的XML投影到data.frame中的正确R方法是什么?

3 个答案:

答案 0 :(得分:4)

使用rvestxml2的包装器),您可以按以下方式执行此操作:

require(rvest)
require(magrittr)
doc <- xml('<posts>
  <row Id="1" PostTypeId="1" 
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
ViewCount="1647" Body="some text;" OwnerUserId="8" 
LastActivityDate="2010-09-15T21:08:26.077" 
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />
</posts>')

rows <- doc %>% xml_nodes("row")
data.frame(
  Id = rows %>% xml_attr("id"),
  PostTypeId = rows %>% xml_attr("posttypeid")
)

导致:

  Id PostTypeId
1  1          1

如果你带 Comments.xml

data.frame(
  Id = rows %>% xml_attr("id"),
  PostTypeId = rows %>% xml_attr("postid"),
  score = rows %>% xml_attr("score")
)

您收到:

> head(dat)
  Id PostTypeId score
1  1          3     5
2  2          5     0
3  3          9     0
4  4          5    11
5  5          3     1
6  6         14     9

答案 1 :(得分:3)

这实际上是XML包中xmlEventParse函数的一个很好的用例。这是一个200多MB的文件,你要做的最后一件事是浪费内存不必要(XML解析是众所周知的内存密集型的),浪费时间多次通过节点。

使用library(XML) library(data.table) # get the # of <rows> quickly; you can approximate if you don't know the # number or can't run this and then chop down the size of the data.frame # afterwards system("grep -c '<row' ~/Desktop/p1.xml") ## 128010 n <- 128010 # pre-populate a data.frame # you could also just write this data out to a file and read it back in # which would negate the need to use global variables or pre-allocate # a data.frame dat <- data.frame(id=rep(NA_character_, n), post_type_id=rep(NA_character_, n), stringsAsFactors=FALSE) # setup a progress bar since there are alot of nodes pb <- txtProgressBar(min=0, max=n, style=3) # this function will be called for each <row> # again, you could write to a file/database/whatever vs do this # data.frame population idx <- 1 process_row <- function(node, tribs) { # update the progress bar setTxtProgressBar(pb, idx) # get our data (you can filter here) dat[idx, "id"] <<- tribs["Id"] dat[idx, "post_type_id"] <<- tribs["PostTypeId"] # update the index idx <<- idx + 1 } # start the parser info <- xmlEventParse("Posts.xml", list(row=process_row)) # close up the progress bar close(pb) head(dat) ## id post_type_id ## 1 1 1 ## 2 2 1 ## 3 3 1 ## 4 4 1 ## 5 5 2 ## 6 6 1 你也可以过滤你做或不需要的东西,你也可以在那里找到一个进度条,这样你就可以看到发生了什么。

vector

答案 2 :(得分:0)

比其他答案容易一点:

require(xml2)
read_xml('Posts.xml') -> doc
xml_children(doc) -> rows
data.frame(
   Id = as.numeric(xml_attr(rows,"Id"))
  ,PostTypeId = as.numeric(xml_attr(rows,"PostTypeId"))
) -> df
  1. 没有rvest / magrittr软件包,只有xml2
  2. 将带数字的字符串转换为数字