更正正确的URL

时间:2016-06-02 01:56:26

标签: python json regex python-2.7

我编写了一个简单的脚本来访问JSON,以获取用于URL的关键字。

以下是我写的脚本:

import urllib2
import json

f1 = open('CatList.text', 'r')
f2 = open('SubList.text', 'w')
lines = f1.read().splitlines()


for  line in lines:

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'
    json_obj = urllib2.urlopen(url)
    data = json.load(json_obj)
    for item in data['query']:
            for i in data['query']['categorymembers']:
                print i['title']
                print '-----------------------------------------'
                f2.write((i['title']).encode('utf8')+"\n")

在此脚本中,程序将首先读取CatList,它提供用于URL的关键字列表。

以下是CatList.text包含的示例。

Category:Branches of geography
Category:Geography by place
Category:Geography awards and competitions
Category:Geography conferences
Category:Geography education
Category:Environmental studies
Category:Exploration
Category:Geocodes
Category:Geographers
Category:Geographical zones
Category:Geopolitical corridors
Category:History of geography
Category:Land systems
Category:Landscape
Category:Geography-related lists
Category:Lists of countries by geography
Category:Navigation
Category:Geography organizations
Category:Places
Category:Geographical regions
Category:Surveying
Category:Geographical technology
Category:Geography terminology
Category:Works about geography
Category:Geographic images
Category:Geography stubs

我的程序获取关键字并将其放在网址中。

但是我无法得到结果。我已通过打印URL检查了代码:

import urllib2
import json

f1 = open('CatList.text', 'r')
f2 = open('SubList2.text', 'w')
lines = f1.read().splitlines()


for  line in lines:

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'
    json_obj = urllib2.urlopen(url)
    data = json.load(json_obj)


    f2.write(url+'\n')

我得到的结果如下:sublist2:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography by place&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography awards and competitions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography conferences&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography education&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Environmental studies&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Exploration&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geocodes&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographers&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical zones&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geopolitical corridors&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:History of geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Land systems&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Landscape&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Lists of countries by geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Navigation&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography organizations&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Places&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical regions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Surveying&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical technology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography terminology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Works about geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographic images&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography stubs&cmlimit=100

它显示URL已正确放置。

但是当我运行完整的代码时,它无法获得正确的结果。

我注意到的一件事是当我放在地址栏的链接中时,例如:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography&cmlimit=100

它提供了正确的结果,因为地址栏会自动将其更正为:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches%20of%20geography&cmlimit=100

我相信如果在“类别:地理分支”一词之间添加%20代替空格,我的脚本将能够获得正确的JSON项目。

问题: 但我不知道如何在上面的代码中修改此语句以获取用%20替换CatList中包含的空格。

请原谅我的格式错误和长篇文章,我还在尝试学习python。

感谢您帮助我。

修改

谢谢蒂姆。您的解决方案有效:

 url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+urllib2.quote(line)+'&cmlimit=100'

能够打印出正确的结果:

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ABranches%20of%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20by%20place&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20awards%20and%20competitions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20conferences&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20education&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AEnvironmental%20studies&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AExploration&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeocodes&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographers&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20zones&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeopolitical%20corridors&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AHistory%20of%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALand%20systems&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALandscape&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography-related%20lists&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ALists%20of%20countries%20by%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ANavigation&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20organizations&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3APlaces&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20regions&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3ASurveying&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographical%20technology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20terminology&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWorks%20about%20geography&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeographic%20images&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AGeography%20stubs&cmlimit=100

1 个答案:

答案 0 :(得分:1)

使用urllib.quote()替换网址中的特殊字符:

Python 2:

import urllib
line = 'Category:Branches of geography'
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=' + urllib.quote(line) + '&cmlimit=100'

https://docs.python.org/2/library/urllib.html#urllib.quote

Python 3:

import urllib.parse
line = 'Category:Branches of geography'
url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=' + urllib.parse.quote(line) + '&cmlimit=100'

https://docs.python.org/3.5/library/urllib.parse.html#urllib.parse.quote