rvest wikipedia使用链接抓取单元格会产生重复

时间:2017-10-24 13:46:24

标签: web-scraping wikipedia rvest

我正在使用rvest从维基百科中搜索HTML表格。 看来,每当有一个包含链接的单元格时,我会在运行html_table()之后得到实际文本,然后是R数据框中链接的名称。

这是一个例子。

males_raw <- read_html("https://en.wikipedia.org/wiki/List_of_Academy_Award_Best_Actor_winners_by_age")

males <- males_raw %>% 
html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
html_table()

males <- males[[1]]

生成一个数据框,其中重复了actor的名称:

dplyr::glimpse(males)


Observations: 91
Variables: 9
$ `#`                         <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "1...
$ Actor                       <chr> "Jannings, EmilEmil Jannings", "Baxter, WarnerWarner Baxter...
$ Film                        <chr> "The Way of All Flesh and The Last CommandThe Last Command,...
$ `Date of birth`             <chr> "1884-07-23(1884-07-23)July 23, 1884", "1889-03-29(1889-03-...
$ `Date of award`             <chr> "May 16, 1929 (1929-05-16)", "April 3, 1930 (1930-04-03)", ...
$ `Age upon\nreceiving award` <chr> "44-2977004163670000000000♠44 years, 297 days", "41-0057004...
$ `Date of death`             <chr> "1950-01-02(1950-01-02)January 2, 1950", "1951-05-07(1951-0...
$ Lifespan                    <chr> "23,903 days (7004239030000000000♠65 years, 163 days)", "22...
$ Notes                       <chr> "Held record as oldest winner for 2 award ceremonies (from ...

我更喜欢这个名字,例如“Jannings,Emil”而不是“Jannings,EmilEmil Jannings”

谢谢!

1 个答案:

答案 0 :(得分:1)

此处的问题是表格单元格包含未显示的span元素。 html_table将内容转换为文本,并将其附加到td元素中的文本中。它也适用于其他列。

<td>
  <span style="display:none">Jannings, Emil</span>
  <a href="/wiki/Emil_Jannings" title="Emil Jannings">Emil Jannings</a>
</td>

解决此问题的一种方法是删除span个节点:

spans <- males_raw %>% 
  html_nodes(xpath = "//*/tr/td/span")

xml_remove(spans)

males_raw %>% 
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
  html_table() %>% 
  .[[1]] %>% 
  glimpse()

Observations: 91
Variables: 9
$ `#`                         <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"...
$ Actor                       <chr> "Emil Jannings", "Warner Baxter", "George Arliss", "Lionel Barrymor...
$ Film                        <chr> "The Last Command,The Way of All Flesh", "In Old Arizona", "Disrael...
$ `Date of birth`             <chr> "July 23, 1884", "March 29, 1889", "April 10, 1868", "April 28, 187...
$ `Date of award`             <chr> "May 16, 1929", "April 3, 1930", "November 5, 1930", "November 10, ...
$ `Age upon\nreceiving award` <chr> "44 years, 297 days", "41 years, 5 days", "62 years, 209 days", "53...
$ `Date of death`             <chr> "January 2, 1950", "May 7, 1951", "February 5, 1946", "November 15,...
$ Lifespan                    <chr> "23,903 days (65 years, 163 days)", "22,683 days (62 years, 39 days...
$ Notes                       <chr> "Held record as oldest winner for 2 award ceremonies (from the 1st ...