熊猫数据框无法正确导出为Excel

时间:2020-10-16 19:32:35

标签: python pandas dataframe web-scraping

我无法使用抓取工具中的所有数据生成一个csv文件。

当我测试一项时,它可以正常工作,导出的csv具有所有列和一行,并具有相应的值。

当我尝试将csv应用于所有代码时,它根本不起作用。

有人可以告诉我我在做什么错吗?

这是抓取工具:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

baseUrl = 'https://www.ebay.com/str/suitcharityestbysaveasuit?_pgn=1'

headers = {
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/74.0.3729.169 Safari/537.36'}

productLinks = []
for x in range(1,2):
 r = requests.get(f'https://www.ebay.com/str/suitcharityestbysaveasuit?_pgn={x}')

 soup = BeautifulSoup(r.content, 'lxml')

 productList = soup.find_all('li', class_='s-item')

 for item in productList:
      for link in item.find_all('a', href=True):
           productLinks.append(link['href'])


alldata = []
for link in productLinks:

    r = requests.get(link, headers=headers)

    soup = BeautifulSoup(r.content, 'lxml')

    data = {}
    data['Name'] = soup.find('h1', class_='it-ttl').text.strip("Details, about")
    try:
        data['Price'] = soup.find('span', class_='notranslate').text.strip("US, $")
    except:
        data['Price'] = 0

    try:
        data['ebayID'] = soup.find('div', class_='u-flL iti-act-num itm-num-txt').text
    except:
        data['ebayID'] = 0
    data['Color'] = soup.find('h2', itemprop='color').text
    data['Brand'] = soup.find('h2', itemprop='brand').text

    try:
        soup = BeautifulSoup(requests.get(link).content, 'html.parser')
        image = soup.select_one('[itemprop="image"]')['src'].replace('l300', 'l1600')
        data['image'] = image
    except:
        data['image'] = 'None'


    for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
        label = label.get_text(strip=True)
        label = label.rstrip(':').lower()
        value = value.get_text(strip=True)
        data[label] = value

    try:
        soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
        number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
        data['Item Number'] = number
    except:
        data['Item Number'] = 'none'

df = pd.DataFrame(alldata)
df.to_csv('data.csv')

1 个答案:

答案 0 :(得分:0)

我想这是因为alldata为空-您从未用抓取的数据填充它。

尝试添加

    try:
        soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
        number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
        data['Item Number'] = number
    except:
        data['Item Number'] = 'none'
        
    alldata.append(data) # <= here

df = pd.DataFrame(alldata)
df.to_csv('data.csv')

编辑

我还注意到您的代码在productLinks中产生重复项。为避免发出不必要的请求,请考虑设置:

for link in set(productLinks):
    ...
    # to keep track of parsed links
    data["link"] = link
    alldata.append(data)

样本输出:

    Name    Price   ebayID  Color   Brand   image   condition   size    fit jacket/coat length  type    jacket cut  color   department  brand   chest size  jacket front button style   material    jacket vent style   pattern size type   Item Number link    country/region of manufacture
0    Canali Men's Plaid Brown Wool Blazer 42L $2,195    55.99   324240475008    Brown   Canali  https://i.ebayimg.com/images/g/mhEAAOSwMUpfEJ8O/s-l1600.jpg Pre-owned:An item that has been used or worn previously. See the seller’s listing for full details anddescription of any imperfections.See all condition definitions- opens in a new window or tab...Read moreabout the condition   42  Athletic    Long    Blazer  Single-Breasted Brown   Men Canali  42  Two-Button  Wool    Double-Vented   Plaid   Regular LXW304-Julw4    https://www.ebay.com/itm/Canali-Mens-Plaid-Brown-Wool-Blazer-42L-2-195/324240475008?hash=item4b7e3d0380:g:mhEAAOSwMUpfEJ8O  
1    Brooks Brothers Men's Gray Plaid Wool Blazer 42R $2,795    92.12   224093561666    Gray    Brooks Brothers https://i.ebayimg.com/images/g/E7YAAOSwCpRfEaso/s-l1600.jpg Pre-owned:An item that has been used or worn previously. See the seller’s listing for full details anddescription of any imperfections.See all condition definitions- opens in a new window or tab...Read moreabout the condition   42  Regular Regular Blazer      Gray    Men Brooks Brothers 42  Three-Button    Wool    Double-Vented   Plaid   Regular LXW373-JULW3    https://www.ebay.com/itm/Brooks-Brothers-Mens-Gray-Plaid-Wool-Blazer-42R-2-795/224093561666?hash=item342d046342:g:E7YAAOSwCpRfEaso  Canada