如何仅使用Python刮除新链接(在上一次刮除之后)

时间:2019-04-10 17:32:16

标签: python web-scraping beautifulsoup

我正在从网站上抓取并下载链接,并且该网站每天都会使用新链接进行更新。我希望这样做,以便每次运行代码时,它仅刮擦/下载自上次运行程序以来的更新链接,而不是再次遍历整个代码。

我尝试将以前被抓取的链接添加到一个空列表中,并且如果在列表中未找到被抓取的链接,则仅执行其余代码(下载并重命名文件)。但这似乎并没有达到预期的效果,每次我运行代码时,它都是从“ 0”开始并覆盖以前下载的文件。

我应该尝试其他方法吗?

这是我的代码(也可以接受有关如何进行清理和改进的一般建议)

import praw
import requests
from bs4 import BeautifulSoup
import urllib.request
from difflib import get_close_matches
import os

period = '2018 Q4'
url = 'https://old.reddit.com/r/test/comments/b71ug1/testpostr23432432/'
headers = {'User-Agent': 'Mozilla/5.0'}
page = requests.get(url, headers=headers)

#set soup
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find_all('table')[0]

#create list of desired file names from existing directory names
candidates = os.listdir('/Users/test/Desktop/Test')
#set directory to download scraped files to
downloads_folder = '/Users/test/Desktop/Python/project/downloaded_files/'

#create empty list of names
scraped_name_list = []

#scrape site for names and links
for anchor in table.findAll('a'):
    try:
        if not anchor:
            continue
        name = anchor.text
        letter_link = anchor['href']
    #if name doesn't exist in list of names: append it to the list, download it, and rename it
        if name not in scraped_name_list:
            #append it to name list
            scraped_name_list.append(name)
            #download it
            urllib.request.urlretrieve(letter_link, '/Users/test/Desktop/Python/project/downloaded_files/' + period + " " + name + '.pdf')
            #rename it
            best_options = get_close_matches(name, candidates, n=1, cutoff=.33)
            try:
                if best_options:
                    name = (downloads_folder + period + " " + name + ".pdf")
                    os.rename(name, downloads_folder + period + " " + best_options[0] + ".pdf")
            except:
                pass
    except:
        pass
    #else skip it
    else:
        pass

1 个答案:

答案 0 :(得分:1)

每次运行此命令时,它都会重新创建scraped_name_list作为新的空列表。您需要做的是在运行结束时保存列表,然后尝试在其他任何运行中将其导入。 pickle库对此非常有用。

而不是定义scraped_name_list = [],请尝试如下操作

try:
    with open('/path/to/your/stuff/scraped_name_list.lst', 'rb') as f:
        scraped_name_list = pickle.load(f)
except IOError:
    scraped_name_list = []

这将尝试打开您的列表,但是如果是第一次运行(表示该列表尚不存在),它将以一个空列表开头。然后在代码末尾,您只需要保存文件,以便文件可以在其他任何时间使用:

with open('/path/to/your/stuff/scraped_name_list.lst', 'wb') as f:
    pickle.dump(scraped_name_list, f)