Question

我不熟悉使用Python进行网络抓取，并且需要有关提取子类别名称（标题）和页面标题（主类别标题）以及使用Python代码所抓取的URL的帮助。我用beautifulsoup尝试了.text，但我认为可能会有更好的选择来执行此任务，因为我遇到了错误，并且一旦使用就没有输出。

我们将不胜感激。请查看代码，并获得有关存储在URL为\ t子类别标题\ t主类别标题的csv文件中的输出的帮助。

示例：Subcategory URL 必填：

http://www.medicalexpo.com/medical-manufacturer/neonatal-incubator-2963.html        Neonatal incubators        Pediatrics
http://www.medicalexpo.com/medical-manufacturer/infant-radiant-warmer-13522.html        
Infant radiant warmers      Pediatrics
http://www.medicalexpo.com/medical-manufacturer/infant-phototherapy-lamp-44327.html        Infant phototherapy lamps        Pediatrics

类似这样的东西

代码：

from bs4 import BeautifulSoup
import requests
import unicodecsv
import time
import random

def get_soup(url):
    return BeautifulSoup(requests.get(url).content, "lxml")

url = 'http://www.medicalexpo.com/'
soup = get_soup(url)
raw_categories = soup.select('div.univers-main li.category-group-item a')
print(raw_categories)
category_links = {}

for cat in (raw_categories):
    t0 = time.time()
    response_delay = time.time() - t0 
    time.sleep(10*response_delay) 
    time.sleep(random.randint(2,5)) 
    soup = get_soup(cat['href'])
    links = soup.select('#category-group li a')

    category_links[cat.links] = [link['href'] for link in links]
    print(category_links)

输出CSV文件中具有Text值的报废数据

0 个答案: