以<div>格式从网页上抓取表格-使用Beautiful Soup

时间:2018-07-06 13:35:07

标签: html selenium web-scraping beautifulsoup scrapy

因此,我打算在使用搜索栏在许可证代码列表上进行迭代之后,从网站https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate中抓取2个表(不同格式)。我还没有完全包含循环,但是为了完整起见,我将其添加到顶部。

我的问题是,因为我想要的两个表(产品数据和证书数据)采用2种不同的格式,所以我必须分别将它们抓取。由于产品数据在网页上采用常规的“ tr”格式,因此这一点很容易,我已经设法提取了此CSV文件。更难的是提取证书数据,因为它是“ div”格式。

我已经使用类函数设法将证书数据打印为文本列表,但是我需要以表格形式将其保存在CSV文件中。如您所见,我尝试了几种不成功的方法将其转换为CSV,但是如果您有任何建议,将不胜感激,谢谢!!另外,任何其他改善我的代码的常规技巧也将非常有用,因为我是网络爬虫的新手。

#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
    #include all of the below

driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)

search_bar = driver.find_element_by_xpath('//*[@id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url

r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0] 
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))

driver.quit()

df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)

#print(df[0].to_json(orient='records'))
print certificate

输出:

Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0

我想要什么,但是需要成百上千的许可证代码(我只是在Excel中手动创建了这个示例):

Desired output

编辑

因此,尽管这现在适用于证书数据,但我也想抓取产品数据并将其输出到另一个.csv文件中。但是目前,它只为最终许可证代码打印5份产品数据副本,这不是我想要的。

新代码:

df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]

def get_data_by_code(code):
    data = [
        ('code', code),
        ('submit', 'Search'),
    ]

    response = requests.post('https://info.fsc.org/certificate.php', data=data)
    soup = BeautifulSoup(response.content, 'lxml')

    status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
    first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
    last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
    expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
    standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text


    return [code, status, first_issue_date, last_issue_date, expiry_date, standard]

# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']


df3=pd.DataFrame()


with open(OUTPUT_FILE_NAME, 'w') as f:
    writer = csv.writer(f)
    for code in codes:
        print('Getting code# {}'.format(code))
        writer.writerow((get_data_by_code(code)))
        table = soup.find_all('table')[0] 
        df1, = pd.read_html(str(table))
        df3 = df3.append(df1) 

df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')

2 个答案:

答案 0 :(得分:0)

这就是您所需要的。 没有chromedriver。没有熊猫在刮擦的情况下,就不用理会了。

import requests
import csv
from bs4 import BeautifulSoup

# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.

#Function to parse single data page based on single input code.
def get_data_by_code(code):

    # Parameters to build POST-request. 
    # "type" and "submit" params are static. "code" is your desired code to scrape.
    data = [
        ('type', 'certificate'),
        ('code', code),
        ('submit', 'Search'),
    ]

    # POST-request to gain page data.
    response = requests.post('https://info.fsc.org/certificate.php', data=data)
    # "soup" object to parse html data.
    soup = BeautifulSoup(response.content, 'lxml')

    # "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
    status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
    # Same for issue dates... etc.
    first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
    last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
    expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
    standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text

    # Returning found data as list of values.
    return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]

# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']

with open(OUTPUT_FILE_NAME, 'w') as f:
    writer = csv.writer(f)
    for code in codes:
        print('Getting code# {}'.format(code))

        #Writing list of values to file as single row.
        writer.writerow((get_data_by_code(code)))

这里的一切都很简单。建议您花一些时间在Chrome开发者工具的“网络”标签中,以更好地了解请求伪造,这对于抓取任务是必不可少的。

enter image description here

通常,您无需运行chrome即可单击“搜索”按钮,而无需伪造此点击所产生的请求。相同的任何形式和ajax。

答案 1 :(得分:0)

好吧...您应该提高自己的技能(:

df3=pd.DataFrame()

with open(OUTPUT_FILE_NAME, 'w') as f:
    writer = csv.writer(f)
    for code in codes:
        print('Getting code# {}'.format(code))
        writer.writerow((get_data_by_code(code)))
        ### HERE'S THE PROBLEM:
        # "soup" variable is declared inside of "get_data_by_code" function.
        # So you can't use it in outer context.
        table = soup.find_all('table')[0] # <--- you should move this line to 
        #definition of "get_data_by_code" function and return it's value somehow...
        df1, = pd.read_html(str(table))
        df3 = df3.append(df1) 

df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')

根据示例,您可以从“ get_data_by_code”函数返回值的字典:

 def get_data_by_code(code):
 ...
     table = soup.find_all('table')[0]
     return dict(row=row, table=table)