我无法使用抓取工具中的所有数据生成一个csv文件。
当我测试一项时,它可以正常工作,导出的csv具有所有列和一行,并具有相应的值。
当我尝试将csv应用于所有代码时,它根本不起作用。
有人可以告诉我我在做什么错吗?
这是抓取工具:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
baseUrl = 'https://www.ebay.com/str/suitcharityestbysaveasuit?_pgn=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/74.0.3729.169 Safari/537.36'}
productLinks = []
for x in range(1,2):
r = requests.get(f'https://www.ebay.com/str/suitcharityestbysaveasuit?_pgn={x}')
soup = BeautifulSoup(r.content, 'lxml')
productList = soup.find_all('li', class_='s-item')
for item in productList:
for link in item.find_all('a', href=True):
productLinks.append(link['href'])
alldata = []
for link in productLinks:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
data = {}
data['Name'] = soup.find('h1', class_='it-ttl').text.strip("Details, about")
try:
data['Price'] = soup.find('span', class_='notranslate').text.strip("US, $")
except:
data['Price'] = 0
try:
data['ebayID'] = soup.find('div', class_='u-flL iti-act-num itm-num-txt').text
except:
data['ebayID'] = 0
data['Color'] = soup.find('h2', itemprop='color').text
data['Brand'] = soup.find('h2', itemprop='brand').text
try:
soup = BeautifulSoup(requests.get(link).content, 'html.parser')
image = soup.select_one('[itemprop="image"]')['src'].replace('l300', 'l1600')
data['image'] = image
except:
data['image'] = 'None'
for label, value in zip(soup.select('td.attrLabels'), soup.select('td.attrLabels + td')):
label = label.get_text(strip=True)
label = label.rstrip(':').lower()
value = value.get_text(strip=True)
data[label] = value
try:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
data['Item Number'] = number
except:
data['Item Number'] = 'none'
df = pd.DataFrame(alldata)
df.to_csv('data.csv')
答案 0 :(得分:0)
我想这是因为alldata
为空-您从未用抓取的数据填充它。
尝试添加
try:
soup = BeautifulSoup(requests.get(soup.iframe['src']).content, 'html.parser')
number = soup.find(text=lambda t: t.strip().startswith('Item no.')).find_next('div').get_text(strip=True)
data['Item Number'] = number
except:
data['Item Number'] = 'none'
alldata.append(data) # <= here
df = pd.DataFrame(alldata)
df.to_csv('data.csv')
编辑
我还注意到您的代码在productLinks
中产生重复项。为避免发出不必要的请求,请考虑设置:
for link in set(productLinks):
...
# to keep track of parsed links
data["link"] = link
alldata.append(data)
样本输出:
Name Price ebayID Color Brand image condition size fit jacket/coat length type jacket cut color department brand chest size jacket front button style material jacket vent style pattern size type Item Number link country/region of manufacture
0 Canali Men's Plaid Brown Wool Blazer 42L $2,195 55.99 324240475008 Brown Canali https://i.ebayimg.com/images/g/mhEAAOSwMUpfEJ8O/s-l1600.jpg Pre-owned:An item that has been used or worn previously. See the seller’s listing for full details anddescription of any imperfections.See all condition definitions- opens in a new window or tab...Read moreabout the condition 42 Athletic Long Blazer Single-Breasted Brown Men Canali 42 Two-Button Wool Double-Vented Plaid Regular LXW304-Julw4 https://www.ebay.com/itm/Canali-Mens-Plaid-Brown-Wool-Blazer-42L-2-195/324240475008?hash=item4b7e3d0380:g:mhEAAOSwMUpfEJ8O
1 Brooks Brothers Men's Gray Plaid Wool Blazer 42R $2,795 92.12 224093561666 Gray Brooks Brothers https://i.ebayimg.com/images/g/E7YAAOSwCpRfEaso/s-l1600.jpg Pre-owned:An item that has been used or worn previously. See the seller’s listing for full details anddescription of any imperfections.See all condition definitions- opens in a new window or tab...Read moreabout the condition 42 Regular Regular Blazer Gray Men Brooks Brothers 42 Three-Button Wool Double-Vented Plaid Regular LXW373-JULW3 https://www.ebay.com/itm/Brooks-Brothers-Mens-Gray-Plaid-Wool-Blazer-42R-2-795/224093561666?hash=item342d046342:g:E7YAAOSwCpRfEaso Canada