网页报废craigslist公寓价格在python没有显示最高成本公寓

时间:2016-04-17 16:43:18

标签: python web-crawler

它显示公寓的最高价格是4700美元,而我可以看到的最高价格超过一百万。为什么没有表现出来?我做错了什么?

import requests
import re


r = requests.get("http://orlando.craigslist.org/search/apa")
r.raise_for_status()

html = r.text


matches = re.findall(r'<span class="price">\$(\d+)</span>', html)
prices = map(int, matches)


print "Highest price: ${}".format(max(prices))
print "Lowest price: ${}".format(min(prices))
print "Average price: ${}".format(sum(prices)/len(prices))

1 个答案:

答案 0 :(得分:1)

使用html解析器bs4非常易于使用,您可以通过将?sort=pricedsc添加到网址来按价格订购,这样第一个匹配将是最大值,最后一个匹配将是最后一个匹配(对于那个页面):

r = requests.get("http://orlando.craigslist.org/search/apa?sort=pricedsc")
from bs4 import BeautifulSoup

html = r.content

soup = BeautifulSoup(html)
print "Highest price: ${}".format(prices[0])
print "Lowest price: ${}".format(prices[-1])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))

如果您想要最低价格,您需要订购升序:

r = requests.get("http://orlando.craigslist.org/search/apa?sort=priceasc")
from bs4 import BeautifulSoup

html = r.content

soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]
print "Highest price: ${}".format(prices[-1])
print "Lowest price: ${}".format(prices[0])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))

现在输出非常不同:

Highest price: $70
Lowest price: $1
Average price: $34.89

如果您想要所有平均值,您需要添加更多逻辑。默认情况下,您只看到100 of 2500结果,但我们可以更改。

r = requests.get("http://orlando.craigslist.org/search/apa")
from bs4 import BeautifulSoup

html = r.content

soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]

# link to next 100 results
nxt = soup.select_one("a.button.next")["href"]

# keep looping until we find a page with no next button
while nxt:
    url = "http://orlando.craigslist.org{}".format(nxt)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    # extend prices to our list
    prices.extend([int(pr.text.strip("$")) for pr in soup.select("span.price")])
    nxt = soup.select_one("a.button.next")
    if nxt:
        nxt = nxt["href"]

这将为您提供1-2500

的每个商家信息