我不熟悉使用Python进行网络抓取,并且需要有关提取子类别名称(标题)和页面标题(主类别标题)以及使用Python代码所抓取的URL的帮助。我用beautifulsoup尝试了.text,但我认为可能会有更好的选择来执行此任务,因为我遇到了错误,并且一旦使用就没有输出。
我们将不胜感激。请查看代码,并获得有关存储在URL为\ t子类别标题\ t主类别标题的csv文件中的输出的帮助。
示例:Subcategory URL 必填:
http://www.medicalexpo.com/medical-manufacturer/neonatal-incubator-2963.html Neonatal incubators Pediatrics
http://www.medicalexpo.com/medical-manufacturer/infant-radiant-warmer-13522.html
Infant radiant warmers Pediatrics
http://www.medicalexpo.com/medical-manufacturer/infant-phototherapy-lamp-44327.html Infant phototherapy lamps Pediatrics
类似这样的东西
代码:
from bs4 import BeautifulSoup
import requests
import unicodecsv
import time
import random
def get_soup(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = 'http://www.medicalexpo.com/'
soup = get_soup(url)
raw_categories = soup.select('div.univers-main li.category-group-item a')
print(raw_categories)
category_links = {}
for cat in (raw_categories):
t0 = time.time()
response_delay = time.time() - t0
time.sleep(10*response_delay)
time.sleep(random.randint(2,5))
soup = get_soup(cat['href'])
links = soup.select('#category-group li a')
category_links[cat.links] = [link['href'] for link in links]
print(category_links)