Question

我正在使用pywikibot-core，我在另一个python Mediawiki API包装器之前使用Wikipedia.py（它有一个.HTML方法）。我切换到pywikibot-core'因为我认为它有更多的功能，但我找不到类似的方法。（小心：我不是很熟练）。

Answer 1

我会在这里发布user283120第二个答案，比第一个答案更精确：

Pywikibot核心不支持任何直接（HTML）方式与Wiki交互，因此您应该使用API。如果需要，可以使用urllib2轻松完成。

这是我用来在公共区域获取维基页面的HTML的示例： import urllib2 ... url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_") html = urllib2.urlopen(url).read().decode('utf-8')

Answer 2

＆＃34; [saveHTML.py]下载文章和图片的HTML页面并将有趣的部分（即文章文本和页脚）保存到文件中＃34;

来源：https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py

Answer 3

IIRC你想要整个页面的HTML，所以你需要使用api.php?action=parse的东西。在Python中，我经常只使用wikitools这样的东西，我不知道PWB或你有其他要求。

Answer 4

一般来说，您应该使用pywikibot而不是维基百科（例如，而不是＆＃34;导入维基百科＆＃34;您应该使用＆＃34;导入pywikibot＆＃34;）并且如果您正在寻找已被删除的方法和类在wikipedia.py中，它们现在是分开的，可以在pywikibot文件夹中找到（主要在page.py和site.py中）

如果要运行在compat中编写的脚本，可以在pywikibot-core中使用名为compat2core.py的脚本（在scripts文件夹中），并提供有关转换的详细帮助，名为README-conversion.txt，read它仔细。

Answer 5

Mediawiki API 有一个解析操作，它允许获取由 Mediawiki 标记解析器返回的 wiki 标记的 html 片段。

对于 pywikibot library，已经实现了一个函数，您可以像这样使用：

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
            
        Returns:
            str: the rendered HTML code for the page
        '''
        page=self.getPage(pageTitle)
        html=page._get_parsed_page()
        return html

使用 mwclient python library 时有一个通用的 api 方法，请参阅： https://github.com/mwclient/mwclient/blob/master/mwclient/client.py

可用于检索这样的 html 代码：

def getHtml(self,pageTitle):
        '''
        get the HTML code for the given page Title
        
        Args:
            pageTitle(str): the title of the page to retrieve
        '''
        api=self.getSite().api("parse",page=pageTitle)
        if not "parse" in api:
            raise Exception("could not retrieve html for page %s" % pageTitle)
        html=api["parse"]["text"]["*"]
        return html

如上所示，这给出了一个 duck typed interface，它在我是提交者的 py-3rdparty-mediawiki 库中实现。此问题已通过关闭 issue 38 - add html page retrieval

解决

Answer 6

使用 Pywikibot，您可以使用 http.request() 来获取 html 内容：

import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])

这应该给 html 内容

'<title>Elvis Presley – Wikipedia</title>\n'

使用 Pywikibot 6.0 http.request() 给出一个 requests.Response 对象而不是纯文本。在这种情况下，您必须使用文本属性：

print(r.text[94:135])

得到相同的结果。

如何使用Pywikibot获取Wiki页面的HTML？

6 个答案: