用python

时间:2017-01-01 17:33:26

标签: python parsing beautifulsoup

我最近发帖要求从黄页中删除数据,而@alecxe通过向我展示一些新方法来提取数据,但我又被卡住了,并且想要抓住黄页中每个链接的数据,这样我就可以了获取包含更多数据的yellowpages页面。我想添加一个名为“url”的变量并获取业务的href,而不是实际的商业网站,而是业务的yellowpages页面。我尝试过各种各样的东西,但似乎没什么用。 href位于“class = business-name”下。

import csv
import requests
from bs4 import BeautifulSoup


with open('cities_louisiana.csv','r') as cities:
    lines = cities.read().splitlines()
cities.close()

for city in lines:
    print(city)
url = "http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms="baton%rouge+LA&page="+str(count)

for city in lines:
    for x in range (0, 50):
        print("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page="+str(x))
        page = requests.get("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page="+str(x))
        soup = BeautifulSoup(page.text, "html.parser")
        for result in soup.select(".search-results .result"):
            try:
                name = result.select_one(".business-name").get_text(strip=True, separator=" ")
            except:
                pass
            try:
                streetAddress = result.select_one(".street-address").get_text(strip=True, separator=" ")
            except:
                pass
            try:
                city = result.select_one(".locality").get_text(strip=True, separator=" ")
                city = city.replace(",", "")
                state = "LA"
                zip = result.select_one('span[itemprop$="postalCode"]').get_text(strip=True, separator=" ")
            except:
                pass

            try:
                telephone = result.select_one(".phones").get_text(strip=True, separator=" ")
            except:
                telephone = "No Telephone"
            try:
                categories = result.select_one(".categories").get_text(strip=True, separator=" ")
            except:
                categories = "No Categories"
            completeData = name, streetAddress, city, state, zip, telephone, categories
            print(completeData)
            with open("yellowpages_businesses_louisiana.csv", "a", newline="") as write:
                wrt = csv.writer(write)
                wrt.writerow(completeData)
                write.close()

1 个答案:

答案 0 :(得分:2)

您应该实施多项内容:

  • href类的business-name类的BeautifulSoup属性中提取业务链接 - 在urljoin()这可以通过"处理"像字典一样的元素
  • 使用BeautifulSoup
  • 制作绝对链接
  • 在维护网络抓取会话的同时向业务页面发出请求
  • 也会使用from urllib.parse import urljoin import requests import time from bs4 import BeautifulSoup url = "http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page=1" with requests.Session() as session: session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'} page = session.get(url) soup = BeautifulSoup(page.text, "html.parser") for result in soup.select(".search-results .result"): business_name_element = result.select_one(".business-name") name = business_name_element.get_text(strip=True, separator=" ") link = urljoin(page.url, business_name_element["href"]) # extract additional business information business_page = session.get(link) business_soup = BeautifulSoup(business_page.text, "html.parser") description = business_soup.select_one("dd.description").text print(name, description) time.sleep(1) # time delay to not hit the site too often 解析业务页面并提取所需信息
  • 添加时间延迟以避免过于频繁地访问网站

完整的工作示例,用于打印搜索结果页面中的商家名称和商家资料页面中的商家说明:

export class MyValidators
{
        public static maxVal(maxVal: number) {
        return (c: FormControl) => {
            return c.value > maxVal ?
                { 'maxVal': { 'MaxValue': maxVal, 'actualValue': c.value } } :
                null;           
        }
    }