Question

我正在建立一个模型，用于按文章质量对原始Wikipedia文本进行分类（Wikipedia拥有约30,000种手工分级文章及其相应质量等级的数据集。）。尽管如此，我正在尝试找出一种算法来计算页面上出现的引文数量。

举个简单的例子：这是原始Wiki页面的摘录：

'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>

到目前为止，我得出的结论是，我可以通过计算[[Image:出现的次数来找到图像的数量。我希望可以做一些类似的事情作为参考。实际上，在比较原始Wiki页面及其对应的实时页面之后， 我认为 ，我能够确定</ref>对应于对的引用的结尾符号Wiki页面。 -> 例如：在这里，您可以看到作者在段落末尾做了一个声明，并在<ref>中引用了 Hammond，58–9 {text} </ref>

如果有人熟悉Wiki的原始数据并且可以对此有所了解，请告诉我！另外，如果您知道更好的方法，也请告诉我！

非常感谢！

Answer 1

ref并不总是包含到源的链接。有时包含指定说明等。
您不仅必须计算<ref>...</ref>，还必须计算footnote templates。
如果您需要计数唯一引用，则必须除分组引用（引用with name="xxx" parameter或具有相同内容的自动分组脚注模板）之外。

对不起，我的英语。

Answer 2

在Wiki标记中计算参考标记不一定准确，因为可以重复使用参考，因此两个</ref>仅会在列表的最后显示为一个参考。有一个API应该提供文章列表，但是由于某种原因，它已被停用，但是BeautifulSoup使此过程变得非常简单。我尚未对此进行测试，以检查它是否可以正确计数所有文章，但是可以正常工作：

from bs4 import BeautifulSoup
import requests

page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')       
soup=BeautifulSoup(page.content,'html.parser') 
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
    count = count + 1

print (count)

如何计算维基百科原始文本中的引文/引文数量？

2 个答案: