如何为webscrapper制作CSV文件?

时间:2018-02-08 18:24:00

标签: python pandas csv web-scraping beautifulsoup

我有一个元组列表,我需要使用pandas将其放入CSV文件,但不知道如何。我想把它们放在一个但是没有用的情况下。这是我想要进入CSV的列表。

tables = soup.find_all("div", {"class":"pane"})[0].find_all("table")

    if (len(tables) > 4):
        product_list = [
            (
                t[0].findAll("div", {"class":"headline"})[0].text.strip(), #title
                t[0].findAll("div", {"class":"copy"})[0].text.strip(),     #description
                t[1].text.strip(),                                         #product number
                t[2].text.strip(),                                         #category number
                t[3].text.strip()                                          #price
            )
            for t in (t.find_all('td') for t in tables[4].find_all('tr'))
            if t
        ]
    elif (len(tables) == 1):
        product_list = [
            (
                t[0].findAll("div", {"class":"catNo"})[0].text.strip(),    #catNo
                t[0].findAll("div", {"class":"headline"})[0].text.strip(), #headline
                t[0].findAll("div", {"class":"price"})[0].text.strip(),    #price
                t[0].findAll("div", {"class":"copy"})[0].text.strip()      #description
            )
            for t in (t.find_all('td') for t in tables[0].find_all('tr'))
            if t
        ]
    else:
        print("could not parse main product\n\n")
        time.sleep(timeDelay)

    print(product_list)
    time.sleep(timeDelay)

    if len(tables) > 5:
        add_product_list = [
            (
                t[0].findAll("div", {"class":"title"})[0].text.strip(), #title
                t[0].findAll("div", {"class":"copy"})[0].text.strip(),  #description
                t[1].text.strip(),                                      #product number
                t[2].text.strip(),                                      #category number
                t[3].text.strip()                                       #price
            )
            for t in (t.find_all('td') for t in tables[5].find_all('tr'))
            if t
        ]
        print(add_product_list)
        time.sleep(timeDelay)

我已经导入了大熊猫,但不知道要放在数据框中的是什么,因为它们不是每个都被命名为特定的项目,它们都被集中在一起。任何帮助都会很棒,因为这是我的拳头之一我做过的擦伤。谢谢!

这也是我正在抓取的HTML / URL脚本的第一部分。

from bs4 import BeautifulSoup
import requests
import time
import random
import csv
import pandas as pd

f = pd.DataFrame

filename = "Qiagen_Scrape_final.csv"
f = open(filename, "w")
headers = "product_name, product discription, Cat No, product number, price\n"
f.write('headers')

product_urls =[
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation', 
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
]

1 个答案:

答案 0 :(得分:0)

以下是如何使用字典用这些数据填充数据框:

from bs4 import BeautifulSoup 
import requests 
import time 
import random 
import pandas as pd 

product_urls = [
  'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation'
]

html = requests.get(product_urls[0]).text
soup = BeautifulSoup(html, 'lxml')

container = soup.find('table')
tables = container.find_all('tr')

dic_list = []
for t in tables[13:]:
    data = t.find_all('td')
    dic = {}
    try:
        dic['title'] = data[0].find('div').text
        dic['description'] = data[0].find("div", {"class":"copy"}).text,  
        dic['prod_number'] = data[1].text,                                        
        dic['cat_number'] = data[2].text,                                        
        dic['price'] = data[3]['price']  
    except:
        pass
    dic_list.append(dic)


df = pd.DataFrame(dic_list)
print(df.sample(3))


       cat_number    description price prod_number  \
25  (MS00064316,)   ((ZmU65-2),)  93.1   (218300,)   
13  (MS00064232,)   ((ZmU49-1),)  93.1   (218300,)   
26  (MS00064323,)  ((OssnoR28),)  93.1   (218300,)   

                                title  
25   Zm_U65-2_1 miScript Primer Assay  
13   Zm_U49-1_1 miScript Primer Assay  
26  Os_snoR28_1 miScript Primer Assay  

最后,保存您的csv文件,如下所示:

df.to_csv('sample_csv.csv', index=False)