同时在xpath中转义双引号和单引号

时间:2019-12-16 21:50:38

标签: r xpath escaping quotes rvest

赏金:当前答案无法完全解决。

更新:非常欢迎我尝试尝试将其翻译为“ R”的非R答案!

类似于How to deal with single quote in xpath,我想转义单引号。区别在于我无法排除在目标字符串中也可能出现双引号的可能性。

目标:

用xpath(在R中)同时转义双引号和单引号。目标元素应该用作变量,而不像现有答案之一那样被硬编码。 (它应该是一个变量,因为我事先不知道内容,它可以有单引号,双引号或两者都有。)

Works:

library(rvest)
library(magrittr)
html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (1)}
[1] <div>Father's son</div>

不起作用:

html <- "<div>1</div><div>Fat\"her's son</div>"
target <- "Fat\"her's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (0)}
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid expression [1207]

4 个答案:

答案 0 :(得分:6)

这里的关键是要意识到,使用xml2,您可以使用html转义的字符写回到解析的html中。此功能可以解决问题。它比需要的时间更长,因为我包括了注释和一些类型检查/转换逻辑。

$.contextMenu({
    selector:'.disposition-menu',
    zIndex: 120,
    callback: function(key, options) {

        var stepID = $(this).closest('.card').attr("id").substring(5);
        alert(stepID);
        handle_disposition(key,stepID);

    },
    items: {
        "pin": {name: "Pin to top",
            callback: function(key,opt) {
                opt.items['pin'].visible = false;
                opt.items['unpin'].visible = true;
            }
        },
        "unpin": {name:"Unpin",
            callback: function(key,opt) {
                opt.items['pin'].visible = true;
                opt.items['unpin'].visible= false;
            }
        },
        "complete": {name: "Mark step completed"},
        "remove": {name: "Remove step from Map"}
   },
    trigger: 'hover'
});

现在:

contains_text <- function(node_set, find_this)
{
  # Ensure we have a nodeset
  if(all(class(node_set) == c("xml_document", "xml_node")))
    node_set %<>% xml_children()

  if(class(node_set) != "xml_nodeset")
    stop("contains_text requires an xml_nodeset or xml_document.")

  # Get all leaf nodes
  node_set %<>% xml_nodes(xpath = "//*[not(*)]")

  # HTML escape the target string
  find_this %<>% {gsub("\"", "&quot;", .)}

  # Extract, HTML escape and replace the nodes
  lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", "&quot;", .)})

  # Now we can define the xpath and extract our target nodes
  xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
  new_nodes <- html_nodes(node_set, xpath = xpath)

  # Since the underlying xml_document is passed by pointer internally,
  # we should unescape any text to leave it unaltered
  xml_text(node_set) %<>% {gsub("&quot;", "\"", .)}
  return(new_nodes)
}

附录

这是另一种方法,它是@Alejandro建议的方法的实现,但允许任意目标。它具有使xml文档保持不变的优点,并且比上述方法要快一些,但是涉及xml库应该防止的字符串解析。它的工作原理是:获取目标,在每个library(rvest) library(xml2) html %>% xml2::read_html() %>% contains_text(target) #> {xml_nodeset (1)} #> [1] <div>Fat"her's son</div> html %>% xml2::read_html() %>% contains_text(target) %>% xml_text() #> [1] "Fat\"her's son" "之后将其分割,然后将每个片段用与它包含的引号相反的引号括起来,然后将它们与逗号一起粘贴回去并将其插入XPath '函数。

concatenate

现在我们可以生成一个有效的xpath,如下所示:

library(stringr)

safe_xpath <- function(target)
{
  target                                 %<>%
  str_replace_all("\"", "&quot;&break;") %>%
  str_replace_all("'", "&apo;&break;")   %>%
  str_split("&break;")                   %>%
  unlist()

  safe_pieces    <- grep("(&quot;)|(&apo;)", target, invert = TRUE)
  contain_quotes <- grep("&quot;", target)
  contain_apo    <- grep("&apo;", target)

  if(length(safe_pieces) > 0) 
      target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

  if(length(contain_quotes) > 0)
  {
    target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
    target[contain_quotes] <- gsub("&quot;", "\"", target[contain_quotes])
  }

  if(length(contain_apo) > 0)
  {
    target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
    target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
  }

  fragment <- paste0(target, collapse = ",")
  return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

这样

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

答案 1 :(得分:2)

使用quote()进行xpath查询

library(XML)

字符串中仅单引号

target1 <- "Father's son"
doc1 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc1, "//body"), doc = doc1)
newXMLNode("div", target1, parent = getNodeSet(doc1, "//body"), doc = doc1)
xpath_query1 <- paste0('//*[ contains(text(), ', '"', target1, '"', ')]')
getNodeSet(doc1, xpath_query1)

字符串中的单引号和双引号

target2 <- "Fat\"her's son"
doc2 <- XML::newHTMLDoc()
newXMLNode("div", 1, parent = getNodeSet(doc2, "//body"), doc = doc2)
newXMLNode("div", target2, parent = getNodeSet(doc2, "//body"), doc = doc2)
xpath_query2 <- quote('//body/*[contains(.,concat(\'Fat"\',"her\'s son"))]')
getNodeSet(doc2, xpath_query2)

输出:

getNodeSet(doc1, xpath_query1)
# [[1]]
# <div>Father's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

getNodeSet(doc2, xpath_query2)
# [[1]]
# <div>Fat"her's son</div> 
# 
# attr(,"class")
# [1] "XMLNodeSet"

答案 2 :(得分:1)

由于您正在使用字符串操作来构建XPath表达式,因此该表达式是有效的XPath是您的责任。表达式:

//*[contains(.,concat('Fat"',"her's son"))]

选择:

<div>Fat"her's son</div>

here中测试

使用XPath字符串变量是一种更好的方法,但是看起来R甚至没有使用libxml的API。

答案 3 :(得分:0)

我在cat函数调用内向目标添加了html_nodes()函数。似乎可以处理这两种情况。 cat()还具有打印转义文本的副作用。

library(rvest)
library(magrittr)

html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father's son
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father's son</div>\n</body></html>
#> [2] <body>\n<div>1</div>\n<div>Father's son</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father's son</div>

html <- "<div>1</div><div>Father said \"Hello!\"</div>"
target <- 'Father said "Hello!"'
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"",cat(target),"\")]"))
#> Father said "Hello!"
#> {xml_nodeset (4)}
#> [1] <html><body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body> ...
#> [2] <body>\n<div>1</div>\n<div>Father said "Hello!"</div>\n</body>
#> [3] <div>1</div>\n
#> [4] <div>Father said "Hello!"</div>