漂亮的汤脚本中有多个“ page_soup”

时间:2018-10-11 16:42:50

标签: python python-3.x beautifulsoup

我正在使用python和数据抓取。我已经建立了一个脚本来刮取网站。

在一个Beautiful Soup脚本中是否可以有两个带有for循环的'page-soup'?还是整个页面都必须一个?即

  1. containers = page_soup.findAll("div",{"class":"ppr_priv_location_detail_header"})

  2. details_containers = page_soup.findAll("div",{"class":"content_block"})

如何添加for循环?

我想要获得的内容是:

content = details_container.findAll("div",{"class":"content"})
        price_range = content.span.text.replace('\n', ' ')

这是我正在使用的代码。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d12801049-Reviews-Core_by_Clare_Smyth-London_England.html'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#HTML PARSER
page_soup = soup(page_html, "html.parser")

filename ="trip2.0.csv"
f = open(filename, 'w')

headers ="title, street_address, price_range\n "

containers = page_soup.findAll("div",{"class":"ppr_priv_location_detail_header"})

f.write(headers)

for container in containers:

    title = container.h1.text

    street_address_container = container.findAll("span",{"class":"street-address"})
    street_address = street_address_container[0].text

    content = details_container.findAll("div",{"class":"content"})
    price_range = content.span.text.replace('\n', ' ')

    print("title: " + title)
    print("street_address: " + street_address)
    print("price_range: " + price_range)


    f.write(title + "," + street_address + "," + price_range + "\n")

f.close()

1 个答案:

答案 0 :(得分:0)

您可以执行以下操作...

import requests, pandas
import collections, datetime, os
from bs4 import BeautifulSoup
now = datetime.datetime.now()
def Save_to_Csv(data):
    filename = 'trip.csv'
    df = pandas.DataFrame(data)
    df.set_index('Date', drop=True, inplace=True)
    if os.path.isfile(filename):
        with open(filename,'a') as f:
            df.to_csv(f, mode='a', sep=",", header=False, encoding='utf-8')
    else:
        df.to_csv(filename, sep=",", encoding='utf-8')
url = ('https://www.tripadvisor.co.uk'
    '/Restaurant_Review-g186338-d12801049'
    '-Reviews-Core_by_Clare_Smyth-London_England.html')
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
title = soup.select('.heading_title')[0].text
street_address = soup.select('.street-address')[0].text
print('Title:', title,'\n','Street_address:',  street_address)
foundings = collections.OrderedDict()
foundings['Date'] = [now.strftime("%Y-%m-%d")]
foundings['Title'] = title
foundings['Street_Address'] = street_address
Save_to_Csv(foundings)

输出:

Title: Core by Clare Smyth
Street_address: 92 Kensington Park Road