浏览网站并迭代搜索结果以获取特定数据

时间:2014-12-24 18:27:32

标签: python web-scraping web-crawler

我正在努力开发一个项目来刮取www.boattrader.com,将每艘船的Make,Price和Phone Number的800个列表推送到CSV文件。

我正在寻找有关从搜索结果中抓取每个船只列表链接的最佳方法的指导,然后解析每个页面以获取制作,价格和电话号码。

任何指导都会非常感激!

再次感谢!

from bs4 import BeautifulSoup, SoupStrainer
import requests


def extract_from_search(search_results):
    # make this into a function
    r = requests.get(search_results)
    ad_page_html = r.text
    soup = BeautifulSoup(ad_page_html, 'html.parser')

    possible_links = soup.find_all('a', {'class': 'btn btn-orange'})

    for link in possible_links:
        if link.has_attr('href'): 
            boat_links = link.attrs['href']

    return boat_links

search_results = 'http://www.boattrader.com/search-results/NewOrUsed-any/Type-all/Zip-90007/Radius-2000/Sort-Length:DESC/Page-1,50'
boat_links = extract_from_search(search_results)
print boat_links #why does this only print one link? What would be the best way to iterate over the search results, so I can put those links into the boat_listing variable to grab the information I'm looking for?

def extract_from_listing(boat_listing):
    r = requests.get(boat_listing)
    ad_page_html = r.text
    soup = BeautifulSoup(ad_page_html, 'html.parser')

    table_heads = soup.find_all('th')

    for th in table_heads:
        if th.text =="Make": 
            make = th.find_next_sibling("td").text 

    price = soup.find('span', {'class': 'bd-price'})

    formatted_price = price.string.strip()

    contact_info = soup.find('div', {'class': 'phone'})
    reversed_phone = contact_info.string[::-1]

    temp_phone = reversed_phone.replace(')', '}')
    temp_phone2 = temp_phone.replace('(', ')')
    correct_phone = temp_phone2.replace("}", "(")

    return make, formatted_price, correct_phone

boat_listing = 'http://www.boattrader.com/listing/2009-Briggs-BR9134-Sportfish-102290211'
make, price, phone = extract_from_listing(boat_listing)
print make
print price
print phone

1 个答案:

答案 0 :(得分:0)

您只返回最后一个链接,您需要附加:

def extract_from_search(search_results):
    # make this into a function
    r = requests.get(search_results)
    ad_page_html = r.text
    soup = BeautifulSoup(ad_page_html, 'html.parser')

    possible_links = soup.find_all('a', {'class': 'btn btn-orange'})
    boat_links = [] # create list to append all inks to 
    for link in possible_links:
        if link.has_attr('href'):
            boat_links.append(link.attrs['href']) # append each link
    return boat_links

或使用列表comp:

def extract_from_search(search_results):
    # make this into a function
    r = requests.get(search_results)
    ad_page_html = r.content # use content to let requests handle the decoding
    soup = BeautifulSoup(ad_page_html, 'html.parser')
    possible_links = soup.find_all('a', {'class': 'btn btn-orange'})
    return [link.attrs['href'] for link in possible_links if link.has_attr('href')]