我怎样才能获得"类别"来自维基百科,以下方式使用r中的R?

时间:2017-09-07 04:39:06

标签: r rvest

我希望在R中使用rvest获取类别(维基百科页面的大部分内容)。我使用SelectorGadget来识别用于类别提取的html节点。我使用的代码如下

thepage <- read_html("https://en.wikipedia.org/wiki/San_Diego")
Categories <- thepage %>% 
            html_nodes("#mw-normal-catlinks") %>%
            html_text()
Categories

获得的结果如下:

"Categories: San Diego1769 establishments in California1850 establishments in CaliforniaCities in San Diego County, CaliforniaCounty seats in CaliforniaIncorporated cities and towns in CaliforniaPopulated coastal places in CaliforniaPopulated places established in 1769San Antonio-San Diego Mail LineSan Diego County, CaliforniaSan Diego metropolitan areaSpanish mission settlements in North AmericaSpecial economic zones of the United StatesStagecoach stops in the United States"

正如您所看到的,没有区分类别的分隔符。第一类是#34;圣地亚哥&#34;第二类是加州的#34; 1769家企业&#34;。如何在列表中获取这些类别或以某种方式分离?

1 个答案:

答案 0 :(得分:1)

每个类别都是一个列表项,因此您需要进入列表:

thepage %>% 
  html_nodes(".mw-normal-catlinks ul li") %>% 
  html_text()

 [1] "San Diego"                                    "1769 establishments in California"           
 [3] "1850 establishments in California"            "Cities in San Diego County, California"      
 [5] "County seats in California"                   "Incorporated cities and towns in California" 
 [7] "Populated coastal places in California"       "Populated places established in 1769"        
 [9] "San Antonio-San Diego Mail Line"              "San Diego County, California"                
[11] "San Diego metropolitan area"                  "Spanish mission settlements in North America"
[13] "Special economic zones of the United States"  "Stagecoach stops in the United States"