x = 0
htmlpar= ""
dictlist = []
datalist = [223, 236, 250, 263, 277, 290, 304, 317, 331, 344, 358, 371, 385, 398, 412, 425, 439, 452, 466, 479]
from urllib.request import urlopen
from html.parser import HTMLParser
html = urlopen("http://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_Weapon%5B%5D=any&appid=730#p5_quantity_desc").read().decode('utf-8')
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
global x
global datalist
global dictlist
x += 1
if x in datalist:
dictlist.append(data)
MyHTMLParser().feed(html)
print(dictlist)
input()
输出:
['0.03', 'Sticker Capsule',
'0.03', 'Sticker Capsule 2',
'0.04', 'eSports Winter Case',
'0.04', 'CS:GO Weapon Case 3',
'0.11', 'Winter Offensive Weapon Case',
'0.04', 'Community Sticker Capsule 1',
'0.10', 'CS:GO Weapon Case 2',
'0.39', 'Operation Phoenix Weapon Case',
'1.10', 'Huntsman Weapon Case',
'0.72', 'eSports Case']
当我尝试解析任何页面时,它只保留蒸汽市场上的第一页:http://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_Weapon%5B%5D=any&appid=730#p5_quantity_desc
网址显然有“#p5”(第5页),但第5页的最重要的是5-7,同时它会不断打印贴纸胶囊(首先是顶部)。
我觉得这可能只适用于Steam,因为在网站上如果你更改页面,它只会重新加载包含数据的框而不是整个网页,但我想确保我的代码中没有愚蠢的东西。