如何以编程方式创建循环URL以进行刮擦

时间:2014-12-10 23:20:26

标签: python python-2.7 web-scraping beautifulsoup web-crawler

我正在尝试这一行代码;但是,我迷失了如何让python代码刮掉一个循环并保存所有内容,以便我可以.csv一切。任何帮助将不胜感激:)

import requests
from bs4 import BeautifulSoup


url = url = "http://www.yellowpages.com/search?search_terms=bodyshop&geo_location_terms=Fort+Lauderdale%2C+FL"

soup = BeautifulSoup(r.content)

links = soup.find_all("a")

from link in links:
    print "<a href='%s'>%s</a>" %(link.get("href"), link.text)

g_data = soup.find_all("div", {"class", "info"})

from item in g_data:
    print item.content[0].find_all("a", {"class": "business-name"})[0].text
    try:
        print item.contents[1].find_all("span", {"itemprop": "streetAddress"})[0].text
    except:
        pass
    try:
        print item.contents[1].find_all("span", {"itemprop": "adressLocality"})[0].text.replace(',', '')
    except:
        pass
    try:
        print item.contents[1].find_all("span", {"itemprop": "adressRegion"})[0].text
    except:
        pass
    try:
        print item.contents[1].find_all("span", {"itemprop": "postalCode"})[0].text
    except:
        pass
    try:
        print item.contents[1].find_all("li", {"class": "primary"})[0].text

我知道这个代码:

url_page2 = url + '&page=' + str(2) '&s=relevance'

我可以循环到第二页,但是如何循环到网站的所有页面结果并使结果在.csv文件中可用?

1 个答案:

答案 0 :(得分:0)

无限循环从1开始递增页码,并在没有结果时退出。定义要提取的字段列表,并依赖itemprop属性来获取字段值。收集字典列表中的项目,稍后可以将其写入csv文件:

from pprint import pprint
import requests

from bs4 import BeautifulSoup


url = "http://www.yellowpages.com/search?search_terms=bodyshop&geo_location_terms=Fort%20Lauderdale%2C%20FL&page={page}&s=relevance"
fields = ["name", "streetAddress", "addressLocality", "addressRegion", "postalCode", "telephone"]

data = []
index = 1
while True:
    url = url.format(page=index)
    index += 1

    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    page_results = soup.select('div.result')
    # exiting the loop if no results
    if not page_results:
        break

    for item in page_results:
        result = dict.fromkeys(fields)
        for field in fields:
            try:
                result[field] = item.find(itemprop=field).get_text(strip=True)
            except AttributeError:
                pass
        data.append(result)

    break  # DELETE ME

pprint(data)

对于第一页,它会打印:

[{'addressLocality': u'Fort Lauderdale,',
  'addressRegion': u'FL',
  'name': u"Abernathy's Paint And Body Shop",
  'postalCode': u'33315',
  'streetAddress': u'1927 SW 1st Ave',
  'telephone': u'(954) 522-8923'},

  ...

 {'addressLocality': u'Fort Lauderdale,',
  'addressRegion': u'FL',
  'name': u'Mega Auto Body Shop',
  'postalCode': u'33304',
  'streetAddress': u'828 NE 4th Ave',
  'telephone': u'(954) 523-9331'}]