如何使用BeautifulSoup从URL列表中提取相同的嵌套数据

时间:2019-02-19 21:25:16

标签: python pandas web-scraping beautifulsoup screen-scraping

下午好,

我对抓取还比较陌生,目前正赶上这个项目。要提取的预期数据是公司名称,地址,电话号码和公司url(均从嵌套网页中提取)。

主页= http://www.therentalshow.com/find-exhibitors/sb-search/equipment/sb-inst/8678/sb-logid/242109-dcja1tszmylg308y/sb-page/1 嵌套页面= http://www.therentalshow.com/exhibitor-detail/cid/45794/exhib/2019

我能够编译此URL列表,但我最难抓取每个公司的信息并以表格格式输出为CSV。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import csv, os

my_url = 'http://www.therentalshow.com/find-exhibitors/sb-search/equipment/sb-inst/8678/sb-logid/242109-dcja1tszmylg308y/sb-page/1'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'lxml')

#create list of urls from main page
urls = []
tags = page_soup.find_all('a',{'class':'avtsb_title'})
for tag in tags:
    urls.append('http://www.therentalshow.com' + tag.get('href'))

#iterate through each page to return company data
for url in urls:
    site = uReq(url)
    soups = soup(site, 'lxml')

    name = page_soup.select('h2')
    address = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblAddress'})
    city = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblCityStateZip'})
    phone = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblPhone'})
    website = page_soup.find('a',{'id':'dnn_ctr8700_TRSExhibitorDetail_hlURL'})

    os.getcwd()
    outputFile = open('output2.csv', 'a', newline='')
    outputWriter = csv.writer(outputFile)
    outputWriter.writerow([name, address, city, phone, website])

我返回的输出是

[],,,,
[],,,,
总共

99行。我的链接总数为100。

我想将上述变量的名称作为csv文件的标题,但是我当前的输出不是我想要的。我很迷茫,所以对您的任何帮助将不胜感激。谢谢!

1 个答案:

答案 0 :(得分:1)

由于requests正在挂起,我目前无法完全测试,但是您需要提取返回的元素.text。另外,您的第一个选择是列表,因此请更改为select_one或适当地索引到列表中。我更喜欢使用CSS选择器而不是查找。

我将一页中的html提取到html变量中

page_soup = bs(html, 'lxml')
name = page_soup.select_one('h2').text
address = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblAddress').text
city = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblCityStateZip').text
phone = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblPhone').text
website = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_hlURL').text
print([name, address, city, phone, website])

使用上述更改从前两个链接复制html会产生

['A-1 Scaffold Manufacturing', '590 Commerce Pkwy', 'Hays, KS', '785-621-5121', 'www.a1scaffoldmfg.com']
['Accella Tire Fill Systems', '2003 Curtain Pole Rd', 'Chattanooga, TN', '423-697-0400', 'www.accellatirefill.com']