如何从Google搜索结果页网址中提取关键字?

时间:2015-05-29 13:15:55

标签: regex r url

我的数据集中的一个变量包含Google搜索结果页的网址。我想从这些网址中提取搜索关键字。

示例数据集:

keyw <- structure(list(user = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", "p2"), class = "factor"),
                   url = structure(c(3L, 5L, 4L, 1L, 2L, 6L), .Label = c("https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw", "https://www.google.nl/search?q=five+fingers&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=kERoVbmMO6fp7AaGioCYAw#safe=off&q=five+short+fingers+", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+a+chair", "https://www.google.nl/search?q=high+five&ie=utf-8&oe=utf-8&gws_rd=cr,ssl&ei=bENoVZSqL4ON7Qb5wIDIDg#safe=off&q=high+five+with+handshake", "https://www.youtube.com/watch?v=6HOallAdtDI"), class = "factor")), 
              .Names = c("user", "url"), class = "data.frame", row.names = c(NA, -6L))

到目前为止,我能够从以下网址中提取搜索关键字部分:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),paste, collapse=",")

然而,这仍然没有给我想要的结果。上面的代码给出了以下结果:

> keyw$words
[1] "q=high+five"                           
[2] "q=high+five,q=high+five+with+handshake"
[3] "q=high+five,q=high+five+with+a+chair"  
[4] "q=five+fingers"                        
[5] "q=five+fingers,q=five+short+fingers+"  
[6] ""                                      

此输出有三个问题:

  1. 我只需要单词作为字符串。而不是q=high+five,我需要high,five
  2. 第2,3行和第2行如图5所示,URL有时包含两个带有搜索关键字的部分。由于第一部分仅仅是对先前搜索的引用,我只需要第二个搜索查询。
  3. 如果网址不是Google搜索网页网址,则应返回NA
  4. 期望的结果应该是:

    > keyw$words
    [1] "high,five"                           
    [2] "high,five,with,handshake"
    [3] "high,five,with,a,chair"  
    [4] "five,fingers"                        
    [5] "five,short,fingers"
    [6] NA
    

    我该如何解决这个问题?

5 个答案:

答案 0 :(得分:11)

评论后的另一个更新(看起来太复杂了,但这是我现在能做到的最好的事情:)):

keyw$words <- sapply(str_extract_all(str_extract(keyw$url,"https?:[/]{2}[^/]*google.*[/].*"),'(?<=q=|[+])([^$+#&]+)(?!.*q=)'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"   "five,fingers"            
[5] "five,short,fingers"       NA             

更改是str_extract_all输入的过滤器,通过“过滤”更改为完整向量以匹配正则表达式,任何正则表达式都可以去那里或多或少精确匹配您想要的内容。

这里的正则表达式是:

  • http litteraly http
  • s? 0或1 s
  • [/]{2}正好两个斜杠(使用字符类避免需要丑陋的\\/构造并使事情更具可读性
  • [^/]*任意数量的非斜线字符
  • google.*[/]匹配litteraly谷歌,然后是最后一个/
  • .*在最后一次斜线后终于匹配或不匹配

将*替换为*,以确保存在参数(+将要求前面的字符至少出现一次)

受@BrodieG启发的更新,如果没有匹配将返回NA,但如果参数中有q=,仍会匹配任何网站。

仍采用相同的方法:

> keyw$words <- sapply(str_extract_all(keyw$url,'(?:(?<=q=|\\+)([^$+#&]+)(?!.*q=))'),function(x) if(!length(x)) NA else paste(x,collapse=","))
> keyw$words
[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA         

Regex demo

(lookbehind (?<=)确保在单词之前的某处有q =或+,而负向前瞻(?!)确保我们找不到q =直到行尾。

字符类不允许+符号停在每个单词上。

答案 1 :(得分:8)

或许这个

gsub("\\+", ",", gsub(".*q=([^&#]*[^+&]).*", "\\1", keyw$url))
# [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
# [4] "five,fingers"             "five,short,fingers"  

答案 2 :(得分:5)

更新(从David借用部分正则表达式):

dat <- as.character(keyw$url)
pat <- "^https://www\\.google\\.nl/.*\\bq=([^&]*[^&+]).*"
sapply(
  regmatches(dat, regexec(pat, dat)),
  function(x) if(!length(x)) NA else gsub("\\+", ",", x[[2]])
)

产地:

[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers"       NA   

使用:

pat <- "^https://www\\.google.(?:com?.)?[a-z]{2,3}/.*\\b?q=([^&]*[^&+]).*"

考虑了所有特定国家/地区的google-domains(source

或者:

gsub("\\+", ",", sub("^.*\\bq=([^&]*).*", "\\1", keyw$url))

产地:

[1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
[4] "five,fingers"             "five,short,fingers,"     

在这里,我们使用贪婪来确保我们跳过最后q=...部分的所有内容,然后使用标准的sub / \\1技巧捕获我们想要的内容。最后,将+替换为,

答案 3 :(得分:3)

必须采用更清洁的方式,但也许是这样的:

sapply(strsplit(keyw$words, "q="), function(x) {
  x <- if (length(x) == 2) x[2] else x[3]
  gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
})
# [1] "high,five"                "high,five,with,handshake" "high,five,with,a,chair"  
# [4] "five,fingers"             "five,short,fingers" 

一切都在一起:

keyw$words <- sapply(str_extract_all(keyw$url, 'q=([^&#]*)'),function(x) {
  x <- if (length(x) == 2) x[2] else x[1]
  x <- gsub("+", ",", gsub("\\+$", "", x), fixed = TRUE)
  gsub("q=","",x, fixed = TRUE)
})

答案 4 :(得分:3)

我会尝试:

x<-as.character(keyw$url)
vapply(regmatches(x,gregexpr("(?<=q=)[^&]+",x,perl=TRUE)),
       function(y) paste(unique(unlist(strsplit(y,"\\+"))),collapse=","),"")
#[1] "high,five"                "high,five,with,handshake"
#[3] "high,five,with,a,chair"   "five,fingers"            
#[5] "five,fingers,short"