我需要获取任何电影的信息框内容。我知道这部电影的名字。一种方法是获取维基百科页面的完整内容,然后解析它直到找到{{Infobox
,然后获取信息框的内容。
使用某些API或解析器还有其他方法吗?
我正在使用Python和pywikipediabot API。
我也熟悉wikitools API。因此,如果某人有与wikitools API相关的解决方案,那么请使用pywikipedia而不是pywikipedia。
答案 0 :(得分:10)
另一个出色的MediaWiki解析器是mwparserfromhell。
In [1]: import mwparserfromhell
In [2]: import pywikibot
In [3]: enwp = pywikibot.Site('en','wikipedia')
In [4]: page = pywikibot.Page(enwp, 'Waking Life')
In [5]: wikitext = page.get()
In [6]: wikicode = mwparserfromhell.parse(wikitext)
In [7]: templates = wikicode.filter_templates()
In [8]: templates?
Type: list
String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
Length: 31
Docstring:
list() -> new empty list
list(iterable) -> new list initialized from iterable's items
In [10]: templates[:2]
Out[10]:
[u'{{Use mdy dates|date=September 2012}}',
u"{{Infobox film\n| name = Waking Life\n| image = Waking-Life-Poster.jpg\n| image_size = 220px\n| alt =\n| caption = Theatrical release poster\n| director = [[Richard Linklater]]\n| producer = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer = Richard Linklater\n| starring = [[Wiley Wiggins]]\n| music = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing = Sandra Adair\n| studio = [[Thousand Words]]\n| distributor = [[Fox Searchlight Pictures]]\n| released = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country = United States\n| language = English\n| budget =\n| gross = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]
In [11]: infobox_film = templates[1]
In [12]: for param in infobox_film.params:
print param.name, param.value
name Waking Life
image Waking-Life-Poster.jpg
image_size 220px
alt
caption Theatrical release poster
director [[Richard Linklater]]
producer [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West
writer Richard Linklater
starring [[Wiley Wiggins]]
music Glover Gill
cinematography Richard Linklater<br />[[Tommy Pallotta]]
editing Sandra Adair
studio [[Thousand Words]]
distributor [[Fox Searchlight Pictures]]
released {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}
runtime 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>
country United States
language English
budget
gross $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>
不要忘记params也是mwparserfromhell对象!
答案 1 :(得分:6)
不要重新发明轮子,请查看DBPedia,它已将所有维基百科的信息框提取为易于分析的数据库格式。
答案 2 :(得分:0)
您可以使用pywikipdiabot获取wikipage内容,然后,您可以使用正则表达式搜索信息框,像mwlib [0]一样使用解析器,甚至使用pywikipediabot并使用他的模板工具之一。例如,在textlib上,您将找到一些处理模板的函数(提示:搜索“#functions with templates”)。 [1]
[0] - http://pypi.python.org/pypi/mwlib
[1] - http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikipedia/pywikibot/textlib.py?view=markup
答案 3 :(得分:0)
任何信息框都是用大括号括起来的模板。让我们看看一个模板以及它是如何嵌入到维基文本中的:
{{Infobox film
| name = Actresses
| image = Actrius film poster.jpg
| alt =
| caption = Catalan language film poster
| native_name = ([[Catalan language|Catalan]]: '''''Actrius''''')
| director = [[Ventura Pons]]
| producer = Ventura Pons
| writer = [[Josep Maria Benet i Jornet]]
| screenplay = Ventura Pons
| story =
| based_on = {{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}
| starring = {{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna Lizaran]]|[[Mercè Pons]]}}
| narrator = <!-- or: |narrators = -->
| music = Carles Cases
| cinematography = Tomàs Pladevall
| editing = Pere Abadal
| production_companies = {{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de Cultura]]|[[Televisión Española]]}}
| distributor = [[Buena Vista International]]
| released = {{film date|df=yes|1997|1|17|[[Spain]]}}
| runtime = 100 minutes
| country = Spain
| language = Catalan
| budget =
| gross = <!--(please use condensed and rounded values, e.g. "£11.6 million" not "£11,586,221")-->
}}
Pywikibot 中有两个高级 Page
方法来解析 wikitext 内容中的任何模板的内容。如果已安装,两者都使用 mwparserfromhell
;否则使用正则表达式,但对于深度 > 3 的嵌套模板,正则表达式可能会失败:
raw_extracted_templates
是一个 Page
属性,它返回一个元组列表,每个元组有两个项目。第一项是作为 str 的模板标识符,例如 'Infobox film'
。第二项是 OrderedDict,模板参数标识符作为键,它们的赋值作为值。例如模板字段
| name = FILM TITLE
| image = FILM TITLE poster.jpg
| caption = Theatrical release poster
导致 OrderedDict 为
OrderedDict((name='FILM TITLE', image='FILM TITLE poster.jpg' caption='Theatrical release poster')
现在如何使用 Pywikibot 获取它?
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.page.raw_extracted_templates
for tmpl, params in all_templates:
if tmpl == 'Infobox film':
pprint(params)
这将打印
OrderedDict([('name', 'Actresses'),
('image', 'Actrius film poster.jpg'),
('alt', ''),
('caption', 'Catalan language film poster'),
('native_name',
"([[Catalan language|Catalan]]: '''''Actrius''''')"),
('director', '[[Ventura Pons]]'),
('producer', 'Ventura Pons'),
('writer', '[[Josep Maria Benet i Jornet]]'),
('screenplay', 'Ventura Pons'),
('story', ''),
('based_on',
"{{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}"),
('starring',
'{{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
'Lizaran]]|[[Mercè Pons]]}}'),
('narrator', ''),
('music', 'Carles Cases'),
('cinematography', 'Tomàs Pladevall'),
('editing', 'Pere Abadal'),
('production_companies',
'{{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - '
'Departament de Cultura]]|[[Televisión Española]]}}'),
('distributor', '[[Buena Vista International]]'),
('released', '{{film date|df=yes|1997|1|17|[[Spain]]}}'),
('runtime', '100 minutes'),
('country', 'Spain'),
('language', 'Catalan'),
('budget', ''),
('gross', '')])
这类似于 raw_extracted_templates 属性,但该方法返回一个包含两个项目的元组列表。第一项是作为 Page
对象的模板。第二项是模板参数列表。看一下示例:
示例代码
from pprint import pprint
import pywikibot
site = pywikibot.Site('wikipedia:en') # or pywikibot.Site('en', 'wikipedia') for older Releases
page = pywikibot.Page(site, 'Actrius')
all_templates = page.templatestemplatesWithParams()
for tmpl, params in all_templates:
if tmpl.title(with_ns=False) == 'Infobox film':
pprint(tmpl)
这将打印列表:
['alt=',
"based_on={{based on|(stage play) ''E.R.''|Josep Maria Benet i Jornet}}",
'budget=',
'caption=Catalan language film poster',
'cinematography=Tomàs Pladevall',
'country=Spain',
'director=[[Ventura Pons]]',
'distributor=[[Buena Vista International]]',
'editing=Pere Abadal',
'gross=',
'image=Actrius film poster.jpg',
'language=Catalan',
'music=Carles Cases',
'name=Actresses',
'narrator=',
"native_name=([[Catalan language|Catalan]]: '''''Actrius''''')",
'producer=Ventura Pons',
'production_companies={{ubl|[[Canal+|Canal+ España]]|Els Films de la Rambla '
'S.A.|[[Generalitat de Catalunya|Generalitat de Catalunya - Departament de '
'Cultura]]|[[Televisión Española]]}}',
'released={{film date|df=yes|1997|1|17|[[Spain]]}}',
'runtime=100 minutes',
'screenplay=Ventura Pons',
'starring={{ubl|[[Núria Espert]]|[[Rosa Maria Sardà]]|[[Anna '
'Lizaran]]|[[Mercè Pons]]}}',
'story=',
'writer=[[Josep Maria Benet i Jornet]]']