我正在建立一个模型,用于按文章质量对原始Wikipedia文本进行分类(Wikipedia拥有约30,000种手工分级文章及其相应质量等级的数据集。)。尽管如此,我正在尝试找出一种算法来计算页面上出现的引文数量。
举个简单的例子:这是原始Wiki页面的摘录:
'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>
到目前为止,我得出的结论是,我可以通过计算[[Image:
出现的次数来找到图像的数量。我希望可以做一些类似的事情作为参考。实际上,在比较原始Wiki页面及其对应的实时页面之后, 我认为 ,我能够确定</ref>
对应于对的引用的结尾符号Wiki页面。 -> 例如:在这里,您可以看到作者在段落末尾做了一个声明,并在<ref>
中引用了 Hammond,58–9 {text} </ref>
如果有人熟悉Wiki的原始数据并且可以对此有所了解,请告诉我!另外,如果您知道更好的方法,也请告诉我!
非常感谢!
答案 0 :(得分:1)
<ref>...</ref>
,还必须计算footnote templates。对不起,我的英语。
答案 1 :(得分:0)
在Wiki标记中计算参考标记不一定准确,因为可以重复使用参考,因此两个</ref>
仅会在列表的最后显示为一个参考。有一个API应该提供文章列表,但是由于某种原因,它已被停用,但是BeautifulSoup使此过程变得非常简单。我尚未对此进行测试,以检查它是否可以正确计数所有文章,但是可以正常工作:
from bs4 import BeautifulSoup
import requests
page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')
soup=BeautifulSoup(page.content,'html.parser')
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
count = count + 1
print (count)