如何计算维基百科原始文本中的引文/引文数量?

时间:2018-08-19 01:52:00

标签: regex nlp wikipedia wikipedia-api pywikibot

我正在建立一个模型,用于按文章质量对原始Wikipedia文本进行分类(Wikipedia拥有约30,000种手工分级文章及其相应质量等级的数据集。)。尽管如此,我正在尝试找出一种算法来计算页面上出现的引文数量。

举个简单的例子:这是原始Wiki页面的摘录:

'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>

到目前为止,我得出的结论是,我可以通过计算[[Image:出现的次数来找到图像的数量。我希望可以做一些类似的事情作为参考。实际上,在比较原始Wiki页面及其对应的实时页面之后, 我认为 ,我能够确定</ref>对应于对的引用的结尾符号Wiki页面。 -> 例如:在这里,您可以看到作者在段落末尾做了一个声明,并在<ref>中引用了 Hammond,58–9 {text} </ref>

如果有人熟悉Wiki的原始数据并且可以对此有所了解,请告诉我!另外,如果您知道更好的方法,也请告诉我!

非常感谢!


2 个答案:

答案 0 :(得分:1)

  1. ref并不总是包含到源的链接。有时包含指定说明等。
  2. 您不仅必须计算<ref>...</ref>,还必须计算footnote templates
  3. 如果您需要计数唯一引用,则必须除分组引用(引用with name="xxx" parameter或具有相同内容的自动分组脚注模板)之外。

对不起,我的英语。

答案 1 :(得分:0)

在Wiki标记中计算参考标记不一定准确,因为可以重复使用参考,因此两个</ref>仅会在列表的最后显示为一个参考。有一个API应该提供文章列表,但是由于某种原因,它已被停用,但是BeautifulSoup使此过程变得非常简单。我尚未对此进行测试,以检查它是否可以正确计数所有文章,但是可以正常工作:

from bs4 import BeautifulSoup
import requests

page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')       
soup=BeautifulSoup(page.content,'html.parser') 
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
    count = count + 1

print (count)