使用文件

时间:2018-06-15 17:23:13

标签: python web-scraping beautifulsoup python-requests

我正在尝试从网站中提取数据并使用以下代码从主类别及其子类别链接中提取所有网址。 我现在停留在使用行分隔符保存提取的输出(将每个URL移动到单独的行中)-Medical.tsv

需要帮助。 代码如下:

from bs4 import BeautifulSoup
import requests
import time
import random

def write_to_file(file,mode, data, newline=None, with_tab=None):   #**
    with open(file, mode, encoding='utf-8') as l:
        if with_tab == True:
            data = ''.join(data)
        if newline == True:
            data = data+'\n'
        l.write(data)

def get_soup(url):
    return BeautifulSoup(requests.get(url).content, "lxml")

url = 'http://www.medicalexpo.com/'
soup = get_soup(url)
raw_categories = soup.select('div.univers-main li.category-group-item a')
category_links = {}

for cat in (raw_categories):
    t0 = time.time()
    response_delay = time.time() - t0 # It wait 10x longer than it took them to respond using delay.
    time.sleep(10*response_delay) # This way if the site gets overwhelmed and starts to slow down, the code will automatically back off.
    time.sleep(random.randint(2,5)) # This will provide random time intervals of 2 and 3 secs acting as human crawl instead of bot.
    soup = get_soup(cat['href'])
    links = soup.select('#category-group li a')
    category_links[cat.text] = [link['href'] for link in links]
    print(category_links)

1 个答案:

答案 0 :(得分:0)

你有write_to_file函数,但你从来没有调用它?模式必须是w或w +(如果你想在文件已经存在的情况下覆盖)