我希望在R中使用rvest获取类别(维基百科页面的大部分内容)。我使用SelectorGadget来识别用于类别提取的html节点。我使用的代码如下
thepage <- read_html("https://en.wikipedia.org/wiki/San_Diego")
Categories <- thepage %>%
html_nodes("#mw-normal-catlinks") %>%
html_text()
Categories
获得的结果如下:
"Categories: San Diego1769 establishments in California1850 establishments in CaliforniaCities in San Diego County, CaliforniaCounty seats in CaliforniaIncorporated cities and towns in CaliforniaPopulated coastal places in CaliforniaPopulated places established in 1769San Antonio-San Diego Mail LineSan Diego County, CaliforniaSan Diego metropolitan areaSpanish mission settlements in North AmericaSpecial economic zones of the United StatesStagecoach stops in the United States"
正如您所看到的,没有区分类别的分隔符。第一类是#34;圣地亚哥&#34;第二类是加州的#34; 1769家企业&#34;。如何在列表中获取这些类别或以某种方式分离?
答案 0 :(得分:1)
每个类别都是一个列表项,因此您需要进入列表:
thepage %>%
html_nodes(".mw-normal-catlinks ul li") %>%
html_text()
[1] "San Diego" "1769 establishments in California"
[3] "1850 establishments in California" "Cities in San Diego County, California"
[5] "County seats in California" "Incorporated cities and towns in California"
[7] "Populated coastal places in California" "Populated places established in 1769"
[9] "San Antonio-San Diego Mail Line" "San Diego County, California"
[11] "San Diego metropolitan area" "Spanish mission settlements in North America"
[13] "Special economic zones of the United States" "Stagecoach stops in the United States"