如何在python中识别Wikipedia类别

时间:2019-02-05 01:53:27

标签: python mediawiki pywikibot

我目前正在使用ngOnInit() { $...add your code here... } 来获取给定维基百科页面的类别(例如pywikibot),如下所示。

support-vector machine

我得到的结果是:

import pywikibot as pw

print([i.title() for i in list(pw.Page(pw.Site('en'), 'support-vector machine').categories())])

如您所见,我得到的结果包括许多维基百科的跟踪和维护类别,例如;

  • 类别:所有带有明显标记的鼬鼠词短语的文章
  • 类别:所有带有非来源声明的文章
  • 类别:CS1维护:使用编辑器参数

但是,我只感兴趣的类别是

  • 类别:分类算法
  • 类别:统计分类
  • 类别:支持向量机

我想知道是否有一种方法可以获取所有[ 'Category:All articles with specifically marked weasel-worded phrases', 'Category:All articles with unsourced statements', 'Category:Articles with specifically marked weasel-worded phrases from May 2018', 'Category:Articles with unsourced statements from June 2013', 'Category:Articles with unsourced statements from March 2017', 'Category:Articles with unsourced statements from March 2018', 'Category:CS1 maint: Uses editors parameter', 'Category:Classification algorithms', 'Category:Statistical classification', 'Category:Support vector machines', 'Category:Wikipedia articles needing clarification from November 2017', 'Category:Wikipedia articles with BNF identifiers', 'Category:Wikipedia articles with GND identifiers', 'Category:Wikipedia articles with LCCN identifiers' ] 维基百科类别,以便可以从结果中删除它们,从而仅获取信息丰富的类别。

或者,如果有其他方法可以从结果中消除它们,请提出建议。

很高兴在需要时提供更多详细信息。

1 个答案:

答案 0 :(得分:2)

pywikibot当前不提供某些API features来过滤隐藏类别。您可以通过在hidden中搜索categoryinfo键来手动完成此操作:

import pywikibot as pw

site = pw.Site('en', 'wikipedia')
print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

给予:

['Category:Classification algorithms', 
 'Category:Statistical classification', 
 'Category:Support vector machines']

有关更多信息,请参见https://www.mediawiki.org/wiki/Help:Categories#Hidden_categorieshttps://en.wikipedia.org/wiki/Wikipedia:Categorization#Hiding_categories