RVest:从列表中检索粗体链接(<li> <a> <b>), following link and saving date (#infobox_patch b)

时间:2017-08-31 14:19:13

标签: r rvest

I'm trying to retrieve a list of release dates for Counter-Strike: Global Offensive's major updates for a Data Scraping assignment. The major updates are bolded in a list supplied by a wikia. The problem is that most of the major update links use (b is a child of a), and I can't retrieve the entire set of links. The code works as intended, it's just the two selectors at the top of the code need to be adjusted.

The script will use an html_session(). It will find suitable links to follow (Provided by the Selectors) and extract the dates with the for loop at the bottom of the script. I tried porting hrbrmstr's code into the script, but I got a NULL from the csgo.patches.date vector.

It's worth noting that 3 of the major updates use <b><a> instead of <a><b>, thats why they show up when you run the scraper iteration at the bottom of the code (There should be 42 major updates as of 01/09/2017).

```{r scraping setup, echo=TRUE}
url.patches <- "http://counterstrike.wikia.com/wiki/Counter-Strike:_Global_Offensive_patches"

## Finds a section of the document (Currently finds li > b)
selector.patches <- "#mw-content-text li b"

## Locates the link to the next page (Stores a date with years)
selector.date <- "a"
```

```{r session, echo=TRUE}
doc.patches <- html_session(url.patches)
```

```{r fetch jobs, echo=TRUE}
csgo.patches <- html_nodes(doc.patches, selector.patches)
cat("Fetched", length(csgo.patches), "results\n")
csgo.patches
```

```{r fetch urls, echo=TRUE}
links.patches <- html_nodes(csgo.patches, selector.date)
href.patches <- html_attr(links.patches, "href")
```

```{r scraper iteration, echo=TRUE}
selector.patchdate <- "#infobox_patch b"

csgo.patches.date <- NULL ## a container for our results, starts off empty
for (csgo.patch in href.patches) {
  csgo.patch.loc <- tryCatch({
    csgo.patch.doc <- jump_to(doc.patches, csgo.patch)
    csgo.patch.loc <- html_node(csgo.patch.doc, selector.patchdate)

    html_text(csgo.patch.loc)
  }, error=function(e) NULL)

  ## add the next location to our results vector
  csgo.patches.date <- c(csgo.patches.date, csgo.patch.loc)
}
csgo.patches.date
```

I appreciate the help, thank you!

1 个答案:

答案 0 :(得分:1)

library(rvest)

pg <- read_html("http://counterstrike.wikia.com/wiki/Counter-Strike:_Global_Offensive_patches")

html_nodes(pg, xpath = ".//li/a/b/.. | .//li/b/a")
## {xml_nodeset (42)}
##  [1] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/January_12, ...
##  [2] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/March_15,_2 ...
##  [3] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/May_23,_201 ...
##  [4] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/July_7,_201 ...
##  [5] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/February_17 ...
##  [6] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/March_17,_2 ...
##  [7] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/April_27,_2 ...
##  [8] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/June_15,_20 ...
##  [9] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/August_18,_ ...
## [10] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/October_6,_ ...
## [11] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/October_13, ...
## [12] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/November_28 ...
## [13] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/December_13 ...
## [14] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/January_8,_ ...
## [15] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/February_26 ...
## [16] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/March_31,_2 ...
## [17] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/April_15,_2 ...
## [18] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/May_26,_201 ...
## [19] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/September_1 ...
## [20] <a href="/wiki/Counter-Strike:_Global_Offensive_patches/October_20, ...
## ...