python - BeautifulSoup和requests不会产生.findAll()的预期结果

时间:2016-12-18 16:50:27

标签: python parsing beautifulsoup python-requests python-3.5

我一直在编写一段代码,用于从Steam市场中检索项目列表及其相应的价格(对于游戏Unturned)。我正在使用BeautifulSoup(bs4)并请求库。到目前为止,这是我的代码:

for page_num in range(1,10):
website = 'http://steamcommunity.com/market/search?appid=304930#p'+str(page_num)+'_popular_desc'
r = requests.get(website)
doc = r.text.split('\n')
soup = BeautifulSoup(''.join(doc), "html.parser")

names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
    items.append(names[item].contents[0])

costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
    prices.append(costs[cost].contents[0])

预期产出:

Festive Gift Present :  $0.32 USD
Halloween Gift Present :  $0.26 USD
Carbon Fiber Mystery Box :  $0.47 USD
Festive Hat :  $1.67 USD
Nuclear Matamorez :  $0.39 USD
... and so on

此代码的问题是,它只获取第一页的名称。如果我使用不同的数字手动键入URL代替 page_num ,则会更改页面,并且HTML文档也会更改。但是,代码似乎没有从第二页获得结果,依此类推。 requests 每次都会获取正确的网址,但HTML文档会返回相同的内容吗?

1 个答案:

答案 0 :(得分:1)

第2,3页等是通过ajax(或类似)请求的,因此首次加载页面时源代码不存在。为了绕过这个,我们可以嗅探ajax url并直接解析源代码,在这种情况下,json编码,即:

import json
from bs4 import BeautifulSoup
from urllib2 import urlopen
output = ""
items =[]
prices =[]
for page_num in range(0,100, 10): #
    start = page_num
    count = page_num + 10

    url = urlopen("http://steamcommunity.com/market/search/render/?query=&start={}&count={}&search_descriptions=0&sort_column=popular&sort_dir=desc&appid=304930".format(start, count))
    jsonCode = json.loads(url.read())
    output += jsonCode['results_html']

soup = BeautifulSoup(output, "html.parser")

names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
    items.append(names[item].contents[0])

costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
    if "Starting at" not in costs[cost].contents[0]: # we just get the first price
        prices.append(costs[cost].contents[0])



print items
[u'Festive Gift Present', u'Halloween Gift Present', u'Hypertech Timberwolf', u'Holiday Scarf', u'Chill Honeybadger', etc...] 
print prices
[u'$0.34 USD', u'$0.28 USD', u'$1.77 USD', u'$0.31 USD', u'$0.65 USD', etc...]

PS: Steam会在~50次请求后暂时禁止您的IP