下午好,
我对抓取还比较陌生,目前正赶上这个项目。要提取的预期数据是公司名称,地址,电话号码和公司url(均从嵌套网页中提取)。
主页= http://www.therentalshow.com/find-exhibitors/sb-search/equipment/sb-inst/8678/sb-logid/242109-dcja1tszmylg308y/sb-page/1 嵌套页面= http://www.therentalshow.com/exhibitor-detail/cid/45794/exhib/2019
我能够编译此URL列表,但我最难抓取每个公司的信息并以表格格式输出为CSV。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import csv, os
my_url = 'http://www.therentalshow.com/find-exhibitors/sb-search/equipment/sb-inst/8678/sb-logid/242109-dcja1tszmylg308y/sb-page/1'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'lxml')
#create list of urls from main page
urls = []
tags = page_soup.find_all('a',{'class':'avtsb_title'})
for tag in tags:
urls.append('http://www.therentalshow.com' + tag.get('href'))
#iterate through each page to return company data
for url in urls:
site = uReq(url)
soups = soup(site, 'lxml')
name = page_soup.select('h2')
address = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblAddress'})
city = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblCityStateZip'})
phone = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblPhone'})
website = page_soup.find('a',{'id':'dnn_ctr8700_TRSExhibitorDetail_hlURL'})
os.getcwd()
outputFile = open('output2.csv', 'a', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow([name, address, city, phone, website])
我返回的输出是
[],,,,
[],,,,
总共99行。我的链接总数为100。
我想将上述变量的名称作为csv文件的标题,但是我当前的输出不是我想要的。我很迷茫,所以对您的任何帮助将不胜感激。谢谢!
答案 0 :(得分:1)
由于requests
正在挂起,我目前无法完全测试,但是您需要提取返回的元素.text
。另外,您的第一个选择是列表,因此请更改为select_one
或适当地索引到列表中。我更喜欢使用CSS选择器而不是查找。
我将一页中的html提取到html变量中
page_soup = bs(html, 'lxml')
name = page_soup.select_one('h2').text
address = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblAddress').text
city = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblCityStateZip').text
phone = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblPhone').text
website = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_hlURL').text
print([name, address, city, phone, website])
使用上述更改从前两个链接复制html会产生
['A-1 Scaffold Manufacturing', '590 Commerce Pkwy', 'Hays, KS', '785-621-5121', 'www.a1scaffoldmfg.com']
['Accella Tire Fill Systems', '2003 Curtain Pole Rd', 'Chattanooga, TN', '423-697-0400', 'www.accellatirefill.com']