我正在尝试在marvel.wikia.com上刮擦Marvel电影及其角色(精选,支持,反对者等)。现在这些字符存在于DOM的列表中,而我无法获得正确的html_nodes()
来获取每种字符类型下的所有列表项。
以下代码提取了所有列出的链接,而我只希望那些属于features-support-拮抗剂-和其他字符(不适用于X2)的链接。
library(rvest)
library(tidyverse)
test_url <- "http://marvel.wikia.com/wiki/X2_(film)"
read_html(test_url) %>%
html_nodes("li > a") %>%
html_text()
所需结果:
# A tibble: 16 x 3
movie type character
<chr> <chr> <chr>
1 X2 Featured Characters Professor Charles Xavier
2 X2 Featured Characters Wolverine (Logan)
3 X2 Featured Characters Storm (Ororo Munroe)
4 X2 Featured Characters Dr. Jean Grey
5 X2 Featured Characters Cyclops (Scott Summers)
6 X2 Featured Characters Rogue (Marie)
7 X2 Featured Characters Iceman (Bobby Drake)
8 X2 Supporting Characters Nightcrawler (Kurt Wagner)
9 X2 Supporting Characters Pyro (John Allerdyce)
10 X2 Supporting Characters Mystique (Raven Darkholme)
11 X2 Supporting Characters Magneto (Erik Lehnsherr)
12 X2 Antagonists Col. William Stryker
13 X2 Antagonists Sgt. Lyman
14 X2 Antagonists Unnamed Soldiers
15 X2 Antagonists Deathstrike (Yuriko Oyama)
16 X2 Antagonists Mutant 143 (Jason Stryker)
答案 0 :(得分:2)
您可以从这样的内容开始-
library(rvest)
library(tidyverse)
test_url <- "http://marvel.wikia.com/wiki/X2_(film)"
#scrape data
url_data <- read_html(test_url) %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/ul') %>%
html_text()
#format scrapped data into desired format
df <- data.frame(movie = gsub(".*/", "", test_url),
type = c("Featured Characters", "Supporting_Characters", "Antagonists", "Other_Characters"),
characters = url_data[1:4]) %>%
separate_rows(characters, sep = "\\n")
给出
> head(df)
movie type characters
1 X2_(film) Featured Characters X-Men
2 X2_(film) Featured Characters Professor Charles Xavier
3 X2_(film) Featured Characters Wolverine (Logan)
4 X2_(film) Featured Characters Storm (Ororo Munroe)
5 X2_(film) Featured Characters Dr. Jean Grey (Apparent death)
6 X2_(film) Featured Characters Cyclops (Scott Summers)