我试图了解如何仅获得所有维基百科页面标题的完整列表。 我发现了类似的问题,但所有问题都建议使用我不知道如何处理的“转储”文件。
我只需要标题。
预先感谢您的支持
答案 0 :(得分:1)
如评论中所建议,您应该使用Wikipedia api,特别是Allpages
。
要获得“全部”(不确定是否可行,请检查apnamespace
的api args)a-z
中的维基百科标题,这是针对此问题的快速线程化脚本:
from time import sleep
import threading, requests, string
all_titles = {} # will hold the final results
def parse_letter(l):
j_obj = requests.get(f"https://en.wikipedia.org/w/api.php?action=query&list=allpages&aplimit=1000&apfrom={l}&format=json").json()
try:
for p in j_obj['query']['allpages']:
try:
all_titles[p['pageid']] = p['title'] # append to final dictionary
print(p['pageid'], p['title'])
except:
pass
except Exception as e:
pass
print(f"Error letter {l}", e)
# loop all letters from a to z.
for l in string.ascii_lowercase: # abcdefghijklmnopqrstuvwxyz
# start threads
threading.Thread(target=parse_letter, args=[l]).start()
# wait threads to finish
while threading.active_count() > 1:
sleep(.2)
from pprint import pprint
pprint(all_titles)
'''
To export a json file, use:
import json
with open("all_titles.json", "w") as f:
f.write(json.dumps(all_titles))
'''
输出(pageid
:title
):
{290: 'A',
4666: 'B*-algebra',
27084: "B'Elanna Torres",
76365: 'B-17',
77818: "B'nai Noach",
92281: "B'alam Quitzé",
92282: "B'alam Quitze",
92283: "B'alam Agab",
...
注意:
aplimit=1000
更改为更高的值(未经测试)。gapfilterredir=nonredirects