只能解析网站的第一页

时间:2014-06-17 04:21:19

标签: python parsing

x = 0
htmlpar= ""
dictlist = []
datalist = [223, 236, 250, 263, 277, 290, 304, 317, 331, 344, 358, 371, 385, 398, 412, 425, 439, 452, 466, 479]
from urllib.request import urlopen
from html.parser import HTMLParser
html = urlopen("http://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_Weapon%5B%5D=any&appid=730#p5_quantity_desc").read().decode('utf-8')
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        global x
        global datalist
        global dictlist
        x += 1
        if x in datalist:
            dictlist.append(data)
MyHTMLParser().feed(html)
print(dictlist)
input()

输出:

['0.03', 'Sticker Capsule',  
 '0.03', 'Sticker Capsule 2', 
 '0.04', 'eSports Winter Case',  
 '0.04', 'CS:GO Weapon Case 3', 
 '0.11', 'Winter Offensive Weapon Case', 
 '0.04', 'Community Sticker Capsule 1', 
 '0.10', 'CS:GO Weapon Case 2', 
 '0.39', 'Operation Phoenix Weapon Case', 
 '1.10', 'Huntsman Weapon Case', 
 '0.72', 'eSports Case']

当我尝试解析任何页面时,它只保留蒸汽市场上的第一页:http://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_Weapon%5B%5D=any&appid=730#p5_quantity_desc

网址显然有“#p5”(第5页),但第5页的最重要的是5-7,同时它会不断打印贴纸胶囊(首先是顶部)。

我觉得这可能只适用于Steam,因为在网站上如果你更改页面,它只会重新加载包含数据的框而不是整个网页,但我想确保我的代码中没有愚蠢的东西。

0 个答案:

没有答案