从JSON获取关键字列表

时间:2016-06-01 07:33:50

标签: python json python-2.7 urllib2

我遇到了一个问题,我不明白为什么会这样打印出来。

下面是我的代码,请原谅我格式错误,因为我是编程新手,这是打开一个包含大量关键字的文本文件

import urllib2
import json

f1 = open('CatList.text')
lines = f1.readlines()

for  line in lines:

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'

    print(url)

    json_obj = urllib2.urlopen(url)
    data = json.load(json_obj)

    #to write the result
    f2 = open('SubList.text', 'w')

    f2.write(url)

    for item in data['query']:

            for i in data['query']['categorymembers']:


                f2.write((i['title']).encode('utf8')+"\n")

我收到错误:

Traceback (most recent call last):
  File "Test2.py", line 16, in <module>
    json_obj = urllib2.urlopen(url)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 402, in open
    req = meth(req)
  File "/usr/lib/python2.7/urllib2.py", line 1113, in do_request_
    raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>

我不确定这个错误意味着什么,但我试着打印网址。

import urllib2
import json

f1 = open('CatList.text')
f2 = open('SubList.text', 'w')
lines = f1.readlines()

for  line in lines:

    url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'

    print(url)
    f2.write(url+'\n')

我得到的结果很奇怪(下面是结果的一部分):

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Branches of geography
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography by place
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography awards and competitions
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography conferences
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography education
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Environmental studies
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Exploration
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geocodes
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographers
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geographical zones
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geopolitical corridors
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:History of geography
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Land systems
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Landscape
&cmlimit=100
https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists
&cmlimit=100

请注意,网址分为两部分

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists
&cmlimit=100 

而不是

  https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Geography-related lists&cmlimit=100 

我的第一个问题是我该如何解决这个问题?

其次,这是给我错误的吗?

我的CatList.text如下:

Category:Branches of geography
Category:Geography by place
Category:Geography awards and competitions
Category:Geography conferences
Category:Geography education
Category:Environmental studies
Category:Exploration
Category:Geocodes
Category:Geographers
Category:Geographical zones
Category:Geopolitical corridors
Category:History of geography
Category:Land systems
Category:Landscape
Category:Geography-related lists
Category:Lists of countries by geography
Category:Navigation
Category:Geography organizations
Category:Places
Category:Geographical regions
Category:Surveying
Category:Geographical technology
Category:Geography terminology
Category:Works about geography
Category:Geographic images
Category:Geography stubs

对不起,很长的帖子。我非常感谢你的帮助。谢谢。

2 个答案:

答案 0 :(得分:2)

朋友,一般'\ n'用于换行。同样的意思,在文件中,每行之间都有隐藏的'\ n'字符。

所以在 lines = f1.readlines()时,它在所有行的末尾都包含'\ n'。这就是问题所在。

为避免这种情况,您应该读作 f1.read.splitlines()

答案 1 :(得分:1)

更新以下行

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line+'&cmlimit=100'  

url ='https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle='+line.strip()+'&cmlimit=100'  

您的line包含换行符(\n),这些字符将使用.strip()删除,从字符串两端删除空格。