删除R中的Span标记;数据出现两次

时间:2016-11-26 22:22:35

标签: html r web-scraping rvest

我正尝试使用以下代码从R中的html页面提取数据:

wiki_url_html <- read_html("https://en.wikipedia.org/wiki/List_of_Major_League_Baseball_players_suspended_for_performance-enhancing_drugs")

bb_player_PED <- (wiki_url_html %>% 
html_nodes(xpath = '//table[3]') %>% html_table())[[1]]

head(bb_player_PED, 10)

给我以下数据:

Player                 Team                           Date announced       Drug      Penalty Position
1              Sánchez, AlexAlex Sánchez Tampa Bay Devil Rays     000000002005-04-03-0000April 3, 2005            010 !10 days       OF
2        Montero, AgustínAgustín Montero        Texas Rangers    000000002005-04-20-0000April 20, 2005            010 !10 days        P
3              Strong, JamalJamal Strong     Seattle Mariners    000000002005-04-26-0000April 26, 2005            010 !10 days       OF
4                Rincón, JuanJuan Rincón      Minnesota Twins       000000002005-05-02-0000May 2, 2005            010 !10 days        P
5    Betancourt, RafaelRafael Betancourt    Cleveland Indians      000000002005-07-08-0000July 8, 2005            010 !10 days        P
6  Palmeiro, RafaelRafael Palmeiro SS GG    Baltimore Orioles    000000002005-08-01-0000August 1, 2005 Stanozolol 010 !10 days       DH
7            Franklin, RyanRyan Franklin     Seattle Mariners    000000002005-08-02-0000August 2, 2005            010 !10 days        P
8                  Morse, MikeMike Morse     Seattle Mariners 000000002005-09-07-0000September 7, 2005            010 !10 days       SS
9        Almanzar, CarlosCarlos Almanzar        Texas Rangers   000000002005-10-04-0000October 4, 2005            010 !10 days        P
10           Heredia, FélixFélix Heredia        New York Mets  000000002005-10-18-0000October 18, 2005            010 !10 days        P
Response Ref.
1       [a]  [5]
2       [b]  [7]
3       [c]  [9]
4       [d]  [9]
5       [e] [12]
6       [f] [14]
7       [g] [16]
8       [h] [18]
9       [i] [20]
10      [j] [22]

我的问题是如何从数据中删除span标记,以便某些数据在Player,Date Announced,Penalty等列中不会出现两次?

我理解这是因为该表有span标签,并且使用上面的代码连接span标签的数据。

我尝试了以下内容:

removeNodes(getNodeSet(xmlTreeParse(wiki_url_html, useInternalNodes = T), "//table/tr/th/i/span"))
像堆栈溢出帖子一样:

Scraping a complex HTML table into a data.frame in R

删除span标记,但它只返回NULL。

任何帮助将不胜感激,谢谢。

0 个答案:

没有答案