我需要帮助在我的脚本中制定新代码,读取报废的文件,如果有任何重复的文件名(不是文件类型),则将其从目录中删除。提前致谢!这是我目前的代码:
from bs4 import BeautifulSoup
import urllib.request
import os
url = urllib.request.urlopen("https://www.fhfa.gov/DataTools/Downloads/Pages/House-Price-Index-Datasets.aspx#mpo")
soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))
FHFA = os.chdir('C:/US_Census/Directory')
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv', '.xml', '.xls', '.xlsx', '.sql', '.txt', '.json']):
continue
filename = href.split('/')[-1]
url = urllib.request.urlretrieve('https://www.fhfa.gov/' + href, filename)
print(filename)
print(' ')
print("All files successfully downloaded.")
答案 0 :(得分:1)
您的代码检索文件名,例如:
HPI_master.csv
HPI_master.xml
HPI_master.sql
...
可以理解的是,你只想要第一个,丢弃其余的。
您可以添加set
来跟踪看到的文件名:
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv', '.xml', '.xls', '.xlsx', '.sql', '.txt', '.json']):
continue
file = href.split('/')[-1]
filename = file.rsplit('.', 1)[0]
if filename not in seen: # only retrieve file if it has not been seen before
seen.add(filename) # add the file to the set
url = urllib.request.urlretrieve('https://www.fhfa.gov/' + href, file)
答案 1 :(得分:0)
添加set
之前已经看过的文件
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
splitted = href.replace('/','.').split('.')
if len(splitted)<2:
continue
filename, fileext = splitted[-2:]
if href.find('.')<0 or not (fileext.lower() in ['csv', 'xml', 'xls', 'xlsx', 'sql', 'txt', 'json']):
continue
if filename in seen:
continue
seen.add(filename)
url = urllib.request.urlretrieve('https://www.fhfa.gov/' + href, filename+'.'+fileext)
print(filename)