因此,我打算在使用搜索栏在许可证代码列表上进行迭代之后,从网站https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate中抓取2个表(不同格式)。我还没有完全包含循环,但是为了完整起见,我将其添加到顶部。
我的问题是,因为我想要的两个表(产品数据和证书数据)采用2种不同的格式,所以我必须分别将它们抓取。由于产品数据在网页上采用常规的“ tr”格式,因此这一点很容易,我已经设法提取了此CSV文件。更难的是提取证书数据,因为它是“ div”格式。
我已经使用类函数设法将证书数据打印为文本列表,但是我需要以表格形式将其保存在CSV文件中。如您所见,我尝试了几种不成功的方法将其转换为CSV,但是如果您有任何建议,将不胜感激,谢谢!!另外,任何其他改善我的代码的常规技巧也将非常有用,因为我是网络爬虫的新手。
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[@id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
输出:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
我想要什么,但是需要成百上千的许可证代码(我只是在Excel中手动创建了这个示例):
编辑
因此,尽管这现在适用于证书数据,但我也想抓取产品数据并将其输出到另一个.csv文件中。但是目前,它只为最终许可证代码打印5份产品数据副本,这不是我想要的。
新代码:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
答案 0 :(得分:0)
这就是您所需要的。 没有chromedriver。没有熊猫在刮擦的情况下,就不用理会了。
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
这里的一切都很简单。建议您花一些时间在Chrome开发者工具的“网络”标签中,以更好地了解请求伪造,这对于抓取任务是必不可少的。
通常,您无需运行chrome即可单击“搜索”按钮,而无需伪造此点击所产生的请求。相同的任何形式和ajax。
答案 1 :(得分:0)
好吧...您应该提高自己的技能(:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
根据示例,您可以从“ get_data_by_code”函数返回值的字典:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)