Question

如何获得完整的维基百科修订历史列表？（不要刮）

import wapiti
import pdb
import pylab as plt  
client = wapiti.WapitiClient('mahmoudrhashemi@gmail.com')
get_revs = client.get_page_revision_infos( 'Coffee', 1000000)
print len(gen_revs)

500

包裹链接：https://github.com/mahmoud/wapiti

Answer 1

如果您需要超过500个修订条目，则必须使用MediaWiki API操作查询，属性修订和参数 rvcontinue ，取自上一个请求，因此您只能通过一个请求获取整个列表：

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=...

要获得您选择的更具体的信息，您还必须使用 rvprop 参数：

&rvprop=ids|flags|timestamp|user|userid|size|sha1|contentmodel|comment|parsedcomment|content|tags|parsetree|flagged

您可以找到的所有可用参数摘要here。

这是如何在C＃中获取完整维基百科的页面修订历史记录：

private static List<XElement> GetRevisions(string pageTitle)
{
    var url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle;
    var revisions = new List<XElement>();
    var next = string.Empty;
    while (true)
    {
        using (var webResponse = (HttpWebResponse)WebRequest.Create(url + next).GetResponse())
        {
            using (var reader = new StreamReader(webResponse.GetResponseStream()))
            {
                var xElement = XElement.Parse(reader.ReadToEnd());
                revisions.AddRange(xElement.Descendants("rev"));

                var cont = xElement.Element("continue");
                if (cont == null) break;

                next = "&rvcontinue=" + cont.Attribute("rvcontinue").Value;
            }
        }
    }

    return revisions;
}

目前，“咖啡”会返回 10 414 版本。

编辑：以下是Python版本：

import urllib2
import re

def GetRevisions(pageTitle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request
    while True:
        response = urllib2.urlopen(url + next).read()     #web request
        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

    return revisions;

你如何看待逻辑是完全一样的。与C＃的不同之处在于，在C＃中我解析了XML响应，在这里我使用正则表达式匹配来自它的所有rev和continue元素。

所以，我的想法是创建一个main request，我将所有修订版本（最大值为500）转换为revisions数组。另外，我检查continue xml元素以了解是否有更多修订，获取rvcontinue属性的值并在next变量中使用它（对于此示例，来自第一个请求它是{ {1}}）让another request进行下一次500次修订。我重复这一切，直到20150127211200|644458070元素可用。如果它丢失了，这意味着在响应的修订列表中的最后一个之后不再有更新，所以我退出循环。

continue

以下是“Coffee”文章的最后10个版本（它们以相反的顺序从API返回），不要忘记，如果您需要更多特定的修订信息，可以使用{您的请求中的{1}}参数。

revisions = GetRevisions("Coffee")

print(len(revisions))
#10418

Answer 2

如果您使用pywikibot，您可以为您提取将运行完整修订历史记录的生成器。例如，要获得一个生成器，它将遍历英语维基百科中页面“pagename”的所有修订版（包括其内容），请使用：

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "pagename")
revs = page.revisions(content=True)

您可以将更多参数应用于查询。您可以找到API文档here

值得注意的是：

修订版（reverse = False，total = None，content = False，rollback = False，starttime = None，endtime = None）

生成器，它将版本历史记录加载为修订版实例。

pywikibot似乎是许多维基百科编辑采用的自动编辑方法。

如何从某篇文章中获取完整的维基百科修订历史列表？

2 个答案: