维基百科的信息框内容

时间:2011-11-11 00:22:46

标签: python mediawiki wikipedia pywikibot

我需要获取任何电影的信息框内容。我知道这部电影的名字。一种方法是获取维基百科页面的完整内容,然后解析它直到找到{{Infobox,然后获取信息框的内容。

使用某些API或解析器还有其他方法吗?

我正在使用Python和pywikipediabot API。

我也熟悉wikitools API。因此,如果某人有与wikitools API相关的解决方案,那么请使用pywikipedia而不是pywikipedia。

4 个答案:

答案 0 :(得分:10)

另一个出色的MediaWiki解析器是mwparserfromhell

In [1]: import mwparserfromhell

In [2]: import pywikibot

In [3]: enwp = pywikibot.Site('en','wikipedia')

In [4]: page = pywikibot.Page(enwp, 'Waking Life')            

In [5]: wikitext = page.get()               

In [6]: wikicode = mwparserfromhell.parse(wikitext)

In [7]: templates = wikicode.filter_templates()

In [8]: templates?
Type:       list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name           = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length:     31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items

In [10]: templates[:2]
Out[10]: 
[u'{{Use mdy dates|date=September 2012}}',
 u"{{Infobox film\n| name           = Waking Life\n| image          = Waking-Life-Poster.jpg\n| image_size     = 220px\n| alt            =\n| caption        = Theatrical release poster\n| director       = [[Richard Linklater]]\n| producer       = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer         = Richard Linklater\n| starring       = [[Wiley Wiggins]]\n| music          = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing        = Sandra Adair\n| studio         = [[Thousand Words]]\n| distributor    = [[Fox Searchlight Pictures]]\n| released       = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime        = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country        = United States\n| language       = English\n| budget         =\n| gross          = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]

In [11]: infobox_film = templates[1]

In [12]: for param in infobox_film.params:
             print param.name, param.value

 name             Waking Life

 image            Waking-Life-Poster.jpg

 image_size       220px

 alt             

 caption          Theatrical release poster

 director         [[Richard Linklater]]

 producer         [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West

 writer           Richard Linklater

 starring         [[Wiley Wiggins]]

 music            Glover Gill

 cinematography   Richard Linklater<br />[[Tommy Pallotta]]

 editing          Sandra Adair

 studio           [[Thousand Words]]

 distributor      [[Fox Searchlight Pictures]]

 released         {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}

 runtime          101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>

 country          United States

 language         English

 budget          

 gross            $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>

不要忘记params也是mwparserfromhell对象!

答案 1 :(得分:6)

不要重新发明轮子,请查看DBPedia,它已将所有维基百科的信息框提取为易于分析的数据库格式。

答案 2 :(得分:0)

您可以使用pywikipdiabot获取wikipage内容,然后,您可以使用正则表达式搜索信息框,像mwlib [0]一样使用解析器,甚至使用pywikipediabot并使用他的模板工具之一。例如,在textlib上,您将找到一些处理模板的函数(提示:搜索“#functions with templates”)。 [1]

[0] - http://pypi.python.org/pypi/mwlib

[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup

答案 3 :(得分:0)

任何信息框都是用大括号括起来的模板。让我们看看一个模板以及它是如何嵌入到维基文本中的:

信息框影片

{{Infobox film
| name           = Actresses
| image          = Actrius film poster.jpg
| alt            = 
| caption        = Catalan language film poster
| native_name      = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director       = [[Ventura Pons]]
| producer       = Ventura Pons
| writer         = [[Josep Maria Benet i Jornet]]
| screenplay     = Ventura Pons
| story          = 
| based_on       = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring       = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator       = <!-- or: |narrators = -->
| music          = Carles Cases
| cinematography = Tomàs Pladevall
| editing        = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor    = [[Buena Vista International]]
| released       = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime        = 100 minutes
| country        = Spain
| language       = Catalan
| budget         = 
| gross          = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}

Pywikibot 中有两个高级 Page 方法来解析 wikitext 内容中的任何模板的内容。如果已安装,两者都使用 mwparserfromhell;否则使用正则表达式,但对于深度 > 3 的嵌套模板,正则表达式可能会失败:

raw_extracted_templates

raw_extracted_templates 是一个 Page 属性,它返回一个元组列表,每个元组有两个项目。第一项是作为 str 的模板标识符,例如 'Infobox film'。第二项是 OrderedDict,模板参数标识符作为键,它们的赋值作为值。例如模板字段

| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster

导致 OrderedDict 为

OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')

现在如何使用 Pywikibot 获取它?

from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en')  # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, params in all_templates:
    if tmpl == 'Infobox film':
        pprint(params)

这将打印

 OrderedDict([('name', 'Actresses'),
              ('image', 'Actrius film poster.jpg'),
              ('alt', ''),
              ('caption', 'Catalan language film poster'),
              ('native_name',
               "([[Catalan language|Catalan]]: '''''Actrius''''')"),
              ('director', '[[Ventura Pons]]'),
              ('producer', 'Ventura Pons'),
              ('writer', '[[Josep Maria Benet i Jornet]]'),
              ('screenplay', 'Ventura Pons'),
              ('story', ''),
              ('based_on',
               "{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
              ('starring',
               '{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
               'Lizaran]]|[[Mercè Pons]]}}'),
              ('narrator', ''),
              ('music', 'Carles Cases'),
              ('cinematography', 'Tomàs Pladevall'),
              ('editing', 'Pere Abadal'),
              ('production_companies',
               '{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
               'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - '
               'Departament de Cultura]]|[[Televisión Española]]}}'),
              ('distributor', '[[Buena Vista International]]'),
              ('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
              ('runtime', '100 minutes'),
              ('country', 'Spain'),
              ('language', 'Catalan'),
              ('budget', ''),
              ('gross', '')])

templatesWithParams()

这类似于 raw_extracted_templates 属性,但该方法返回一个包含两个项目的元组列表。第一项是作为 Page 对象的模板。第二项是模板参数列表。看一下示例:

示例代码

from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en')  # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
    if tmpl.title(with_ns=False) == 'Infobox film':
        pprint(tmpl)

这将打印列表:

['alt=',
 "based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
 'budget=',
 'caption=Catalan language film poster',
 'cinematography=Tomàs Pladevall',
 'country=Spain',
 'director=[[Ventura Pons]]',
 'distributor=[[Buena Vista International]]',
 'editing=Pere Abadal',
 'gross=',
 'image=Actrius film poster.jpg',
 'language=Catalan',
 'music=Carles Cases',
 'name=Actresses',
 'narrator=',
 "native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
 'producer=Ventura Pons',
 'production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
 'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de '
 'Cultura]]|[[Televisión Española]]}}',
 'released={{film date|df=yes|1997|1|17|[[Spain]]}}',
 'runtime=100 minutes',
 'screenplay=Ventura Pons',
 'starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
 'Lizaran]]|[[Mercè Pons]]}}',
 'story=',
 'writer=[[Josep Maria Benet i Jornet]]']