如何在R

时间:2017-11-22 09:53:06

标签: r xml

我有两个复杂的XML文件,我想找到它们之间的差异。

我需要的是找到:

  • 仅存在于两个XML
  • 中的一个中的标记
  • 彼此不同的值

我已尝试compareXMLDocs套餐XML,但效果不理想。

实施例

XML1
<root>
  <first>name1</first>
  <second>id1</second>
  <third>
    <third.1>something</third.1>
    <third.2>something else</third.2>
  </third>
  <fifth>no differences</fifth>
</root>


XML2
<root>
  <second>id2</second>
  <third>
    <third.1>something2</third.1>
    <third.2>something else2</third.2>
  </third>
  <fourth>blahblah</fourth>
  <fifth>no differences</fifth>
</root>

所以当我与compareXMLDocs比较时,我有:

> compareXMLDocs(a, b)
$inA
first 
    1 

$inB
fourth 
     1 

$countDiffs
named integer(0)

我知道first标记仅用于XML1,而fourth标记仅用于XML2。但我不知道第三版和第三版中的值是不同的例如。这就是我要找的。我不明白countDiffs的作用。这里似乎没什么用处。

我也尝试在数据框中转换XML,但输出格式不是很有帮助。对于树很深的大型XML文件,它会变得最糟糕。

我希望这个例子的结果是这样的数据框:

Path                  A                B
/root/first           name1            NA
/root/second          id1              id2
/root/third/third.1   something        something2
/root/third/third.2   something else   something else2
/fourth               NA               blahblah

1 个答案:

答案 0 :(得分:3)

数据:

library(xml2)
library(tidyverse)

read_xml("<root>
  <first>name1</first>
  <second>id1</second>
  <third>
    <third.1>something</third.1>
    <third.2>something else</third.2>
  </third>
  <fifth>no differences</fifth>
</root>
") -> d1

read_xml("
<root>
  <second>id2</second>
  <third>
    <third.1>something2</third.1>
    <third.2>something else2</third.2>
  </third>
  <fourth>blahblah</fourth>
  <fifth>no differences</fifth>
</root>
") -> d2

制作快速帮助功能:

# NOTE: this will not handle attributes
as_path_df <- function(x) {
  as_list(x) %>%
    unlist() %>%
    as.list() %>%
    as_data_frame() %>%
    gather(key, val)
}

这是^^的作用:

(d1_p <- as_path_df(d1))
## # A tibble: 5 x 2
##             key            val
##           <chr>          <chr>
## 1         first          name1
## 2        second            id1
## 3 third.third.1      something
## 4 third.third.2 something else
## 5         fifth no differences

(d2_p <- as_path_df(d2))
## # A tibble: 5 x 2
##             key             val
##           <chr>           <chr>
## 1        second             id2
## 2 third.third.1      something2
## 3 third.third.2 something else2
## 4        fourth        blahblah
## 5         fifth  no differences

键?

setdiff(d1_p$key, d2_p$key)
## [1] "first"

值?

rename(d1_p, d1_val=val) %>%
  left_join(rename(d2_p, d2_val=val)) %>%
  mutate(same = (d1_val == d2_val))
## # A tibble: 5 x 4
##             key         d1_val          d2_val   same
##           <chr>          <chr>           <chr>  <lgl>
## 1         first          name1            <NA>     NA
## 2        second            id1             id2  FALSE
## 3 third.third.1      something      something2  FALSE
## 4 third.third.2 something else something else2  FALSE
## 5         fifth no differences  no differences   TRUE

可能只能在is.na()_val列的same列中使用setdiff()作为关键缺失部分。但/* Make an ajax call and put the results in the movies array */ getMovies() { let self = this; axios.get('https://pastebin.com/raw/FF6Vec6B') .then(response => self.setState({ movies: response.data })); } 超级快。