我开发了此程序,以在newegg.com的页面上刮取每个产品的名称,价格和每个ps4的运输成本。但是,由于上面有多个带有ps4的页面,我该如何向源变量添加多个链接。基本上,我想在newegg.com上抓取多个页面(例如:ps4页面#1,#2,#4等)。
from bs4 import BeautifulSoup
import requests
import csv
source = requests.get('https://www.newegg.com/PS4-Systems/SubCategory/ID-3102').text
soup = BeautifulSoup(source, 'lxml')
csv_file = open('newegg_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Product', 'Price', 'Shipping_info'])
for info in soup.find_all('div', class_='item-container'):
prod = info.find('a', class_='item-title').text.strip()
price = info.find('li', class_='price-current').text.strip().splitlines()[1].replace(u'\xa0', '')
if u'$' not in price:
price = info.find('li', class_='price-current').text.strip().splitlines()[0].replace(u'\xa0', '')
ship = info.find('li', class_='price-ship').text.strip()
print(prod)
print(price)
print(ship)
csv_writer.writerow([prod, price, ship])
# print(price.splitlines()[1])
print('-----------')
csv_file.close()
答案 0 :(得分:0)
我不使用PHP,但过去曾使用Perl进行屏幕抓取。
如果您注意到页面底部附近,则有一个用于添加其他页面的按钮栏。您会发现第2页和其他网址的格式为https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-2?PageSize=36&order=BESTMATCH
只需进行循环即可构造URL,以将Page-2替换为Page-3、4等,然后查询,抓取重复。我猜您会一直继续下去,直到没有回应或页面不再包含您正在寻找的信息为止。
答案 1 :(得分:0)
根据其选择器获取页面数(从抓取的第一页开始),然后遍历页面数,同时在源中包含页码。
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102/Page-' + page_number
from bs4 import BeautifulSoup
import requests
import csv
base_url = 'https://www.newegg.com/PS4-Systems/SubCategory/ID-3102'
# Grab the number of pages
def get_pages_number(soup):
pages_number = soup.find('div', class_='list-tool-pagination').find('strong').text.split('/')[1]
return int(pages_number)
# Your code + dynamic URL + return number of pages
def scrape_page(page_number=1):
# Make the source "dynamic" based on the page number
source = requests.get(f'{base_url}/Page-{page_number}').text
soup = BeautifulSoup(source, 'lxml')
# Soup processing goes here
# You can use the code you posted to grab the price, etc...
return get_pages_number(soup)
# Main function
if __name__ == '__main__':
pages_number = scrape_page()
# If there are more pages, we scrape them
if pages_number > 1:
for i in range(1, pages_number):
scrape_page(i + 1)