我为只有一个表并且已经设置了列等的页面构建了爬网。非常简单。该网站有3个不同的表格,按随机单元细分。我只需要第一张表中的信息。我已经创建了所需信息的列表。不确定如何组织它并通过从csv文件中提取网址来使其运行。
如果我将其分解为一个URL,则可以从许可证中打印信息。但是我无法让它适用于多个网址。我觉得我完全把事情弄复杂了。
以下是我尝试运行的网址的一些示例:
http://search.ccb.state.or.us/search/business_details.aspx?id=221851
http://search.ccb.state.or.us/search/business_details.aspx?id=221852
http://search.ccb.state.or.us/search/business_details.aspx?id=221853
代码全都搞砸了,但这就是我所拥有的
我非常感谢您提供的所有帮助
import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup as BS
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
def get_page():
contents = []
with open('OR_urls.csv','r') as csvf:
urls = 'csv.reader(csvf)'
r = requests.get(url)
data = {}
data['biz_info_object'] = soup(id='MainContent_contractornamelabel')[0].text.strip()
data['lic_number_object'] = soup(id='MainContent_licenselabel')[0].text.strip()
data['lic_date_object'] = soup(id='MainContent_datefirstlabel')[0].text.strip()
data['lic_status_object'] = soup(id='MainContent_licensestatuslabel')[0].text.strip()
data['lic_exp_object'] = soup(id='MainContent_licenseexpirelabel')[0].text.strip()
data['biz_address_object'] = soup(id='MainContent_addresslabel')[0].text.strip()
data['biz_phone_object'] = soup(id='MainContent_phonelabel')[0].text.strip()
data['biz_address_object'] = soup(id='MainContent_endorsementlabel')[0].text.strip()
with open('OR_urls.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
page = ('get_page')
df1 = pd.read_html(page)
答案 0 :(得分:1)
正如您所说,您似乎组合了一些不同的脚本。希望以下内容可以帮助您更好地了解所需的结构。
我假设您的OR_urls.csv
文件在第一列中包含您的URL。它一次从CSV文件读取一行,并使用requests.get()
库调用返回该网页。然后使用BeautifulSoup对其进行解析,并将您的各种元素从页面提取到字典中。然后将其与URL一起显示。
from bs4 import BeautifulSoup
import requests
import csv
with open('OR_urls.csv') as f_input:
csv_input = csv.reader(f_input)
for url in csv_input:
r = requests.get(url[0]) # Assume the URL is in the first column
soup = BeautifulSoup(r.text, "html.parser")
data = {}
data['biz_info_object'] = soup.find(id='MainContent_contractornamelabel').get_text(strip=True)
data['lic_number_object'] = soup.find(id='MainContent_licenselabel').get_text(strip=True)
data['lic_date_object'] = soup.find(id='MainContent_datefirstlabel').get_text(strip=True)
data['lic_status_object'] = soup.find(id='MainContent_licensestatuslabel').get_text(strip=True)
data['lic_exp_object'] = soup.find(id='MainContent_licenseexpirelabel').get_text(strip=True)
data['biz_address_object'] = soup.find(id='MainContent_addresslabel').get_text(strip=True)
data['biz_phone_object'] = soup.find(id='MainContent_phonelabel').get_text(strip=True)
data['biz_address_object'] = soup.find(id='MainContent_endorsementlabel').get_text(strip=True)
print(url[0], data)
为您提供以下输出:
http://search.ccb.state.or.us/search/business_details.aspx?id=221851 {'biz_info_object': 'ANDREW LLOYD PARRY', 'lic_number_object': '221851', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(802) 779-7180'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221852 {'biz_info_object': 'SHANE MICHAEL DALLMAN', 'lic_number_object': '221852', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 933-5406'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221853 {'biz_info_object': 'INTEGRITY HOMES NW INC', 'lic_number_object': '221853', 'lic_date_object': '7/24/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/24/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 522-6055'}
您可以通过为所需的所有ID创建一个列表并使用字典理解来构建它来进一步改进。 csv.DictWriter()
可以用于将数据写入CSV文件:
from bs4 import BeautifulSoup
import requests
import csv
objects = (
('biz_info_object', 'MainContent_contractornamelabel'),
('lic_number_object', 'MainContent_licenselabel'),
('lic_date_object', 'MainContent_datefirstlabel'),
('lic_status_object', 'MainContent_licensestatuslabel'),
('lic_exp_object', 'MainContent_licenseexpirelabel'),
('biz_address_object', 'MainContent_addresslabel'),
('biz_phone_object', 'MainContent_phonelabel'),
('biz_address_object', 'MainContent_endorsementlabel'),
)
with open('OR_urls.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.DictWriter(f_output, fieldnames=[name for name, id in objects])
csv_output.writeheader()
for url in csv_input:
r = requests.get(url[0]) # Assume the URL is in the first column
soup = BeautifulSoup(r.text, "html.parser")
data = {name : soup.find(id=id).get_text(strip=True) for name, id in objects}
csv_output.writerow(data)