在管道中的同一对象上调用两个不同的函数(%>%)

时间:2019-03-26 19:01:20

标签: r magrittr

我想知道是否有一种方法可以同时调用html_name()html_text(来自rvest包)并从同一个管道({{1 }})

这里是一个例子:

magrittr::%>%

这时我想从uniprot_ac <- "P31374" GET(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>% content(as = "raw", content = "text/xml") %>% read_html %>% html_nodes(xpath = '//recommendedname/* | //name[@type="primary"] | //comment[@type="function"]/text | //comment[@type="interaction"]/text') 中获得两个标签名

html_name()

并标记内容,而不必通过重写整个管道以将最后一行更改为[1] "fullname" "ecnumber" "name" "text"

来创建单独的对象
html_text()

所需的输出可能是这样的,矢量或data.frame都没有关系

[1] "Serine/threonine-protein kinase PSK1"                                                                                                                                                                                                                                                                             
[2] "2.7.11.1"                                                                                                                                                                                                                                                                                                         
[3] "PSK1"                                                                                                                                                                                                                                                                                                             
[4] "Serine/threonine-protein kinase involved ... ... 

5 个答案:

答案 0 :(得分:5)

也许有点hack,但是您可以在管道中使用带括号的匿名函数:

library("magrittr")
library("httr")
library("xml2")
library("rvest")

uniprot_ac <- "P31374"

GET(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>%
  content(as = "raw", content = "text/xml") %>%
  read_html %>%
  html_nodes(xpath = '//recommendedname/* |
             //name[@type="primary"] | //comment[@type="function"]/text |
             //comment[@type="interaction"]/text') %>% 
  (function(x) list(name = html_name(x), text = html_text(x)))
#$name
#[1] "fullname" "ecnumber" "name"     "text"    
#
#$text
#[1] "Serine/threonine-protein kinase PSK1"                                                                                                                                                                                                                                                                             
#[2] "2.7.11.1"                                                                                                                                                                                                                                                                                                         
#[3] "PSK1"                                                                                                                                                                                                                                                                                                             
#[4] "Serine/threonine-protein kinase involved in the control of sugar metabolism and translation. Phosphorylates UGP1, which is required for normal glycogen and beta-(1,6)-glucan synthesis. This phosphorylation shifts glucose partitioning toward cell wall glucan synthesis at the expense of glycogen synthesis."

或者,您可以使用purrr软件包来做一些更优雅的事情,但是我看不出为什么要为此加载整个软件包的原因。

修改 如@MrFlick在评论中所指出的,如果正确将其放在大括号中,则点(.)占位符可以执行相同的操作。

GET(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>%
  content(as = "raw", content = "text/xml") %>%
  read_html %>%
  html_nodes(xpath = '//recommendedname/* |
             //name[@type="primary"] | //comment[@type="function"]/text |
             //comment[@type="interaction"]/text') %>% 
  {list(name = html_name(.), text = html_text(.))}

可以说,这是一种更加严格的习惯用法,实际上, 已记录在help("%>%")中。

答案 1 :(得分:4)

您可以创建一个自定义函数,该函数接受html_nodes对象并对其执行所需的任何操作:

html_name_text <- function(nodes) {
    list(html_name(nodes), html_text(nodes))
}

GET(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>%
    content(as = "raw", content = "text/xml") %>%
    read_html %>%
    html_nodes(xpath = '//recommendedname/* |
               //name[@type="primary"] | //comment[@type="function"]/text |
               //comment[@type="interaction"]/text') %>%
    html_name_text()

[[1]]
[1] "fullname" "ecnumber" "name"     "text"    

[[2]]
[1] "Serine/threonine-protein kinase PSK1"                                                                                                                                                                                                                                                                             
[2] "2.7.11.1"                                                                                                                                                                                                                                                                                                         
[3] "PSK1"                                                                                                                                                                                                                                                                                                             
[4] "Serine/threonine-protein kinase involved in the control of sugar metabolism and translation. Phosphorylates UGP1, which is required for normal glycogen and beta-(1,6)-glucan synthesis. This phosphorylation shifts glucose partitioning toward cell wall glucan synthesis at the expense of glycogen synthesis."

答案 2 :(得分:4)

这是一种purrr方法,它返回一个tibble

library(tidyverse)
library(rvest)

uniprot_ac <- "P31374"
read_html(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>%
  html_nodes(xpath = '//recommendedname/* |
               //name[@type="primary"] | //comment[@type="function"]/text |
               //comment[@type="interaction"]/text') %>% 
  map(~ list(name = html_name(.), text = html_text(.))) %>%
  bind_rows
#> # A tibble: 4 x 2
#>   name     text                                                            
#>   <chr>    <chr>                                                           
#> 1 fullname Serine/threonine-protein kinase PSK1                            
#> 2 ecnumber 2.7.11.1                                                        
#> 3 name     PSK1                                                            
#> 4 text     Serine/threonine-protein kinase involved in the control of suga~

reprex package(v0.2.1)于2019-03-26创建

答案 3 :(得分:3)

一种选择是在管道后使用方括号,将当前结果存储在临时对象中(如果需要),然后计算您想要的不同结果:

GET(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>%
    content(as = "raw", content = "text/xml") %>%
    read_html %>%
    html_nodes(xpath = '//recommendedname/* |
               //name[@type="primary"] | //comment[@type="function"]/text |
               //comment[@type="interaction"]/text') %>% {
    list(name = html_name(.), text = html_text(.))
    }

仅供参考,有时您需要通过临时对象,如本例所示:

iris %>% 
  select(Sepal.Length, Sepal.Width) %>% {
     temp <- .
     bind_rows(temp %>% filter(Sepal.Length > 5), 
               temp %>% filter(Sepal.Width <= 3))
} %>% 
  dim()

在这种情况下,如果将temp直接替换为.,将无法正常工作。

答案 4 :(得分:3)

您无需做任何额外的包装,也不需要花括号和小圆点:

nodes %>% lapply(list(html_name, html_text), function(x,y) x(y), .)
# [[1]]
# [1] "fullname" "ecnumber" "name"     "text"    
# 
# [[2]]
# [1] "Serine/threonine-protein kinase PSK1"                                                                                                                                                                                                                                                                             
# [2] "2.7.11.1"                                                                                                                                                                                                                                                                                                         
# [3] "PSK1"                                                                                                                                                                                                                                                                                                             
# [4] "Serine/threonine-protein kinase involved in the control of sugar 

或者以下内容,稍紧凑但带有花括号:

nodes %>% {lapply(list(html_name, html_text), do.call, list(.))}

我会使用purrr并在函数上循环,并将这些函数与exec一起传递给.作为参数:

library(purrr)
nodes %>% map(list(html_name, html_text), exec, .)

(相同的输出)

数据

library("magrittr")
library("httr")
library("xml2")
library("rvest")
nodes <- GET(paste0("https://www.uniprot.org/uniprot/", uniprot_ac, ".xml")) %>%
  content(as = "raw", content = "text/xml") %>%
  read_html %>%
  html_nodes(xpath = '//recommendedname/* |
             //name[@type="primary"] | //comment[@type="function"]/text |
             //comment[@type="interaction"]/text')