我一直在编写一段代码,用于从Steam市场中检索项目列表及其相应的价格(对于游戏Unturned)。我正在使用BeautifulSoup(bs4)并请求库。到目前为止,这是我的代码:
for page_num in range(1,10):
website = 'http://steamcommunity.com/market/search?appid=304930#p'+str(page_num)+'_popular_desc'
r = requests.get(website)
doc = r.text.split('\n')
soup = BeautifulSoup(''.join(doc), "html.parser")
names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
items.append(names[item].contents[0])
costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
prices.append(costs[cost].contents[0])
预期产出:
Festive Gift Present : $0.32 USD
Halloween Gift Present : $0.26 USD
Carbon Fiber Mystery Box : $0.47 USD
Festive Hat : $1.67 USD
Nuclear Matamorez : $0.39 USD
... and so on
此代码的问题是,它只获取第一页的名称。如果我使用不同的数字手动键入URL代替 page_num ,则会更改页面,并且HTML文档也会更改。但是,代码似乎没有从第二页获得结果,依此类推。 requests 每次都会获取正确的网址,但HTML文档会返回相同的内容吗?
答案 0 :(得分:1)
第2,3页等是通过ajax
(或类似)请求的,因此首次加载页面时源代码不存在。为了绕过这个,我们可以嗅探ajax
url并直接解析源代码,在这种情况下,json
编码,即:
import json
from bs4 import BeautifulSoup
from urllib2 import urlopen
output = ""
items =[]
prices =[]
for page_num in range(0,100, 10): #
start = page_num
count = page_num + 10
url = urlopen("http://steamcommunity.com/market/search/render/?query=&start={}&count={}&search_descriptions=0&sort_column=popular&sort_dir=desc&appid=304930".format(start, count))
jsonCode = json.loads(url.read())
output += jsonCode['results_html']
soup = BeautifulSoup(output, "html.parser")
names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
items.append(names[item].contents[0])
costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
if "Starting at" not in costs[cost].contents[0]: # we just get the first price
prices.append(costs[cost].contents[0])
print items
[u'Festive Gift Present', u'Halloween Gift Present', u'Hypertech Timberwolf', u'Holiday Scarf', u'Chill Honeybadger', etc...]
print prices
[u'$0.34 USD', u'$0.28 USD', u'$1.77 USD', u'$0.31 USD', u'$0.65 USD', etc...]
PS: Steam会在~50次请求后暂时禁止您的IP