pgno = 1
while pgno < 4304:
result = urllib.urlopen("http://www.example.comtraderesourcespincode.aspx?" +
"&GridInfo=Pincode0"+ pgno)
print pgno
html = result.read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
pgno += 1
在http://.......=Pincode0
我需要添加1 ..例如像'Pincode01',将它循环01到02,03 ..我正在使用while循环,并且赋值的变量是'pgno'。< / p>
问题是计数器加1,但'Pincode01'没有变成'Pincode02'......因此它没有打开网站的第2页。
我甚至试过+str(pgno))
......没有运气。
请说明如何操作。我无法做到这一点......并多次尝试过。
答案 0 :(得分:1)
可能你想要这个:
from urllib import urlopen
import re
pgno = 2
url = "http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode0%s" %str(pgno)
print url +'\n'
sock = urlopen(url)
htmlcode = sock.read()
sock.close()
x = re.search('%;"><a href="javascript:__doPostBack',htmlcode).start()
pat = ('\t\t\t\t<td style="width:\d+%;">(\d+)</td>'
'<td style="width:\d+%;">(.+?)</td>'
'<td style="width:\d+%;">(.+?)</td>'
'<td style="width:30%;">(.+?)</td>\r\n')
regx = re.compile(pat)
print '\n'.join(map(repr,regx.findall(htmlcode,x)))
结果
http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode02
('110001', 'New Delhi', 'Delhi', 'Baroda House')
('110001', 'New Delhi', 'Delhi', 'Bengali Market')
('110001', 'New Delhi', 'Delhi', 'Bhagat Singh Market')
('110001', 'New Delhi', 'Delhi', 'Connaught Place')
('110001', 'New Delhi', 'Delhi', 'Constitution House')
('110001', 'New Delhi', 'Delhi', 'Election Commission')
('110001', 'New Delhi', 'Delhi', 'Janpath')
('110001', 'New Delhi', 'Delhi', 'Krishi Bhawan')
('110001', 'New Delhi', 'Delhi', 'Lady Harding Medical College')
('110001', 'New Delhi', 'Delhi', 'New Delhi Gpo')
('110001', 'New Delhi', 'Delhi', 'New Delhi Ho')
('110001', 'New Delhi', 'Delhi', 'North Avenue')
('110001', 'New Delhi', 'Delhi', 'Parliament House')
('110001', 'New Delhi', 'Delhi', 'Patiala House')
('110001', 'New Delhi', 'Delhi', 'Pragati Maidan')
('110001', 'New Delhi', 'Delhi', 'Rail Bhawan')
('110001', 'New Delhi', 'Delhi', 'Sansad Marg Hpo')
('110001', 'New Delhi', 'Delhi', 'Sansadiya Soudh')
('110001', 'New Delhi', 'Delhi', 'Secretariat North')
('110001', 'New Delhi', 'Delhi', 'Shastri Bhawan')
('110001', 'New Delhi', 'Delhi', 'Supreme Court')
('110002', 'New Delhi', 'Delhi', 'Rajghat Power House')
('110002', 'New Delhi', 'Delhi', 'Minto Road')
('110002', 'New Delhi', 'Delhi', 'Indraprastha Hpo')
('110002', 'New Delhi', 'Delhi', 'Darya Ganj')
我在用以下代码研究了HTML源代码的结构后编写了这段代码(我想你会理解它而不再做任何解释):
from urllib2 import Request,urlopen
import re
pgno = 2
url = "http://www.eximguru.com/traderesources/pincode.aspx?&GridInfo=Pincode0%s" %str(pgno)
print url +'\n'
sock = urlopen(url)
htmlcode = sock.read()
sock.close()
li = htmlcode.splitlines(True)
print '\n'.join(str(i) + ' ' + repr(line)+'\n' for i,line in enumerate(li) if 275<i<300)
ch = ''.join(li[0:291])
from collections import defaultdict
didi =defaultdict(int)
for c in ch:
didi[c] += 1
print '\n\n'+repr(li[289])
print '\n'.join('%r -> %s' % (c,didi[c]) for c in li[289] if didi[c]<35)
现在,问题是为pgno的所有值返回相同的HTML。该站点可能检测到它是一个想要连接和获取数据的程序。这个问题必须使用 urllib2 中的工具来处理,但我没有接受过这方面的培训。
答案 1 :(得分:1)
如果您的问题是使用数字的格式,请使用此而不是将str添加到int:
>>> pgno = 1
>>> while pgno < 20:
... print '%02d' % pgno
... pgno += 1
...
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
有关更多选项,请参阅string format docs
此外,使用string format
以更加pythonic的方式>>> for pgno in range(9, 12):
... print '{0:02d}'.format(pgno)
...
09
10
11
答案 2 :(得分:0)
循环:
pgno = 1
while pgno < 4304:
print pgno
pgno += 1
正常工作,数量正在增加。
您要么以错误的方式描述问题,要么问题的基本假设存在问题。你能否首先尝试描述你想要做的事情?