Question

我试图了解如何仅获得所有维基百科页面标题的完整列表。我发现了类似的问题，但所有问题都建议使用我不知道如何处理的“转储”文件。

我只需要标题。

预先感谢您的支持

Answer 1

如评论中所建议，您应该使用Wikipedia api，特别是Allpages。
要获得“全部”（不确定是否可行，请检查apnamespace的api args）a-z中的维基百科标题，这是针对此问题的快速线程化脚本：

from time import sleep
import threading, requests, string

all_titles = {} # will hold the final results

def parse_letter(l):
    j_obj = requests.get(f"https://en.wikipedia.org/w/api.php?action=query&list=allpages&aplimit=1000&apfrom={l}&format=json").json()
    try:
        for p in j_obj['query']['allpages']:
            try:
                all_titles[p['pageid']] = p['title'] # append to final dictionary
                print(p['pageid'], p['title'])
            except:
                pass
    except Exception as e:
        pass
        print(f"Error letter {l}", e)

#  loop all letters from a to z.
for l in string.ascii_lowercase: # abcdefghijklmnopqrstuvwxyz
    # start threads
    threading.Thread(target=parse_letter, args=[l]).start()

# wait threads to finish
while threading.active_count() > 1:
    sleep(.2)

from pprint import pprint
pprint(all_titles)

'''
To export a json file, use:
import json
with open("all_titles.json", "w") as f:
     f.write(json.dumps(all_titles))
'''

输出（pageid：title）：

{290: 'A',
 4666: 'B*-algebra',
 27084: "B'Elanna Torres",
 76365: 'B-17',
 77818: "B'nai Noach",
 92281: "B'alam Quitzé",
 92282: "B'alam Quitze",
 92283: "B'alam Agab",
...

注意：

您可以尝试将aplimit=1000更改为更高的值（未经测试）。
要过滤所有重定向页面，请使用gapfilterredir=nonredirects
阅读Wikipedia api的Allpages文档
Demo

如何从维基百科获取所有标题的JSON

1 个答案: