有兴趣只搜索维基百科xml转储,仅用于医学相关术语

时间:2015-03-30 22:44:59

标签: python regex xml perl wikipedia

我想自动定义医学术语。然而,标准医学词典WordNet并不足够。因此,我downloaded使用维基百科语料库。但是,当我下载enwiki-latest-pages-articles.xml时(顺便提一下,这个词开始于#34;无政府主义" - 为什么不是" AA"?)我立刻失败了{{1由于文件的大小,并开始在线查找。我发现了我认为已经为此编写过的库,比如Perl的grep(我知道一些Perl,但我更喜欢Python,因为那是我编写的脚本) ,但看起来他们中的大多数创建或需要某种数据库(我只想(尽管模糊)匹配一个单词并抓住其介绍段落的前几句;例如,搜索' {{3将返回:

MediaWiki::DumpFile

出于我的目的(仅将其用作一种术语表),这些脚本是我想要的(我发现文档很难理解而没有示例)?例如,我想:

  1. 只是为了减少搜索材料,删除所有与医疗无关的内容(我尝试使用salmonella过滤器,因为维基百科允许导出特定类别,但他们没有按照我的意愿工作;例如,' Medicine'只返回大约20页,所以我更愿意以某种方式处理xml文件。)

  2. 允许我的Python脚本快速搜索维基百科语料库(例如,如果我想匹配Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.[1]. 我希望它能带我到CHOLERAE的定义,就像实际维基百科搜索功能(只是带我到最佳选择)。我写了一种可以做到这一点的搜索引擎,但是这么大的文件(40 GB)会很慢。

  3. 提前道歉,这可能是一个非常天真的问题。

1 个答案:

答案 0 :(得分:2)

这是查询维基百科数据库而不下载整个内容的一种方法。

import requests
import argparse

parser = argparse.ArgumentParser(description='Fetch wikipedia extracts.')
parser.add_argument('word', help='word to define')
args = parser.parse_args()

proxies = {
    # See http://www.mediawiki.org/wiki/API:Main_page#API_etiquette
    # "http": "http://localhost:3128",
}

headers = {
    # http://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
    "User-Agent": "Definitions/1.0 (Contact rob@example.com for info.)"
}

params = {
    'action':'query',
    'prop':'extracts',
    'format':'json',
    'exintro':1,
    'explaintext':1,
    'generator':'search',
    'gsrsearch':args.word,
    'gsrlimit':1,
    'continue':''
}

r = requests.get('http://en.wikipedia.org/w/api.php',
                 params=params,
                 headers=headers,
                 proxies=proxies)
json = r.json()
if "query" in json:
    result = json["query"]["pages"].items()[0][1]["extract"]
    print result.encode('utf-8')
else:
    print "No definition."

以下是一些结果。请注意,即使单词拼写错误,它仍会返回结果。

$ python define.py CHOLERAE
Vibrio cholerae is a Gram-negative, comma-shaped bacterium. Some strains of V. cholerae cause the disease cholera. V. cholerae is a facultative anaerobic organism and has a flagellum at one cell pole. V. cholerae was first isolated as the cause of cholera by Italian anatomist Filippo Pacini in 1854, but his discovery was not widely known until Robert Koch, working independently 30 years later, publicized the knowledge and the means of fighting the disease.
$ python define.py salmonella
Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.
Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.
$ python define.py salmanela
Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.
Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.