Question

我希望抓住其中一个新的stackexchange网站的主页：https://webapps.stackexchange.com/（只有一次，只有几页，没有什么应该打扰服务器）。如果我想从stackoverflow中获取它，我知道有一个数据库转储，但是对于新的stackexchange，它们还不存在。

这是我想要做的。

第1步：选择网址

URL <- "https://webapps.stackexchange.com/"

第2步：阅读表格

readHTMLTable(URL)  # oops, doesn't work - gives NULL

第2步：这一次，让我们尝试用XML

htmlTreeParse(URL) # o.k, this reads the data - but it is all in <div> - now what?

所以我能够阅读页面，但现在结构是div。它现在如何用于创建与readHTMLTable相同的东西？

Answer 1

您可以使用the overflowr package（使用StackExchange API）执行此操作。只需使用get.questions（）函数并提供站点前缀即可。它不在CRAN上，因为它不完整，但您可以下载并构建它。

library(overflowr)
questions <- get.questions(50)

对于统计网站，最近的前5个问题：

questions <- get.questions(top.n=5, site="stats.stackexchange")

顺便说一下，很高兴能够有更多的人参与这个项目，因为我没有时间花在这个项目上。 Three of the moderators from Stats.Exchange are currently working on it

Answer 2

你在写什么？我编写了一个解析Web scrape（link）的应用程序。我会更乐意分享逻辑。

如何从stackexchange主页抓取“表格式”数据？（在R中）

2 个答案:

如何从stackexchange主页抓取“表格式”数据？ （在R中）

2 个答案:

如何从stackexchange主页抓取“表格式”数据？（在R中）