是否有任何R包允许查询维基百科(最有可能使用Mediawiki API)获取与此类查询相关的可用文章列表,以及导入文本挖掘所选文章?
答案 0 :(得分:11)
有WikipediR
,'R'
library(devtools)
install_github("Ironholds/WikipediR")
library(WikipediR)
它包括以下功能:
ls("package:WikipediR")
[1] "wiki_catpages" "wiki_con" "wiki_diff" "wiki_page"
[5] "wiki_pagecats" "wiki_recentchanges" "wiki_revision" "wiki_timestamp"
[9] "wiki_usercontribs" "wiki_userinfo"
此处正在使用中,获取一组用户的贡献详细信息和用户详细信息:
library(RCurl)
library(XML)
# scrape page to get usernames of users with highest numbers of edits
top_editors_page <- "http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits"
top_editors_table <- readHTMLTable(top_editors_page)
very_top_editors <- as.character(top_editors_table[[3]][1:5,]$User)
# setup connection to wikimedia project
con <- wiki_con("en", project = c("wikipedia"))
# connect to API and get last 50 edits per user
user_data <- lapply(very_top_editors, function(i) wiki_usercontribs(con, i) )
# and get information about the users (registration date, gender, editcount, etc)
user_info <- lapply(very_top_editors, function(i) wiki_userinfo(con, i) )
答案 1 :(得分:6)
使用RCurl
包查看详细信息,使用XML
或RJSONIO
包来解析回复。
如果您在代理服务器后面,请设置选项。
opts <- list(
proxy = "136.233.91.120",
proxyusername = "mydomain\\myusername",
proxypassword = 'whatever',
proxyport = 8080
)
使用getForm
功能访问the API。
search_example <- getForm(
"http://en.wikipedia.org/w/api.php",
action = "opensearch",
search = "Te",
format = "json",
.opts = opts
)
解析结果。
fromJSON(rawToChar(search_example))
答案 2 :(得分:0)
wikifacts
包(在 CRAN 上)是一种新的可能性:
library(wikifacts)
wiki_define('R (programming language)')
## R (programming language)
## "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity; as of April 2021, R ranks 16th in the TIOBE index, a measure of popularity of programming languages.The official R software environment is a GNU package.\nIt is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems."