我可以使用pywikipedia来获取页面的文本吗?

时间:2009-06-20 15:49:28

标签: python wiki mediawiki pywikibot

是否可以使用pywikipedia获取页面文本,而无需任何内部链接或模板&没有图片等。?

4 个答案:

答案 0 :(得分:4)

如果您的意思是“我只想获取wiki文本”,请查看wikipedia.Page类和get方法。

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

通过这种方式,您可以从文章中获得完整的原始wiki文本。

如果要删除wiki语法,将[[Concept inventory]]转换为Concept清单等等,那将会更加痛苦。

造成这种麻烦的主要原因是MediaWiki wiki语法没有定义语法。这使得解析和剥离变得非常困难。我目前不知道没有允许您准确执行此操作的软件。当然还有MediaWiki Parser课程,但它是PHP,有点难以掌握,其目的非常不同。

但是如果你只想删除链接,或者非常简单的wiki构造使用正则表达式:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

然后是管道链接:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

等等。

但是,例如,没有可靠的简单方法从页面中去除嵌套模板。对于在评论中有链接的图像也是如此。这很难,并且涉及递归删除最内部链接并用标记替换它并重新开始。如果你愿意,可以查看wikipedia.py中的templateWithParams函数,但它并不漂亮。

答案 1 :(得分:1)

根据您的需要,有一个名为mwparserfromhell on Github的模块可以让您非常接近您想要的内容。它有一个名为strip_code()的方法,它剥离了很多标记。

import pywikibot
import mwparserfromhell

test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()

full = mwparserfromhell.parse(text)
stripped = full.strip_code()

print full
print '*******************'
print stripped

比较代码段:

{{db-foreign}}
<!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   

==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


*******************

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   

Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 

答案 2 :(得分:0)

您可以使用 wikitextparser。例如:

import pywikibot
import wikitextparser
en_wikipedia = pywikibot.Site('en', 'wikipedia')
text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
print(wikitextparser.parse(text).sections[0].plain_text())

会给你:

"Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.

答案 3 :(得分:0)

Pywikibot 能够删除任何 wikitext 或 html 标签。 textlib里面有两个函数:

  1. removeHTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> str:

    返回不包含禁用 HTML 标记但在 html 标记之间保留文本的部分的文本。例如:

     from pywikibot Import textlib
     text = 'This is <small>small</small> text'
     print(removeHTMLParts(text, keeptags=[]))
    

    这将打印:

     This is small text
    
  2. removeDisabledParts(text: str, tags=None, include=[], site=None) -> str: 返回没有禁用 wiki 标记的部分的文本。这删除 wikitext 文本中的文本。例如:

     from pywikibot Import textlib
     text = 'This is <small>small</small> text'
     print(removeDisabledPartsParts(text, tags=['small']))
    

    这将打印:

     This is  text
    

    有很多预定义的标签要删除或保留,例如 'comment', 'header', 'link', 'template';

    标签参数的默认值为 ['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

    其他一些例子:

    removeDisabledPartsParts('See [[this link]]', tags=['link'])'See ' removeDisabledPartsParts('<!-- no comments -->', tags=['comment'])'' removeDisabledPartsParts('{{Infobox}}', tags=['template']) 给出 '',但仅适用于 Pywikibot 6.0.0 或更高版本