我正试图刮掉' .xlsx'来自Tax Foundation网站的文件。可悲的是,我一直收到一条错误消息:Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file
。我做了一些研究,并说它解决这个问题的方法是将文件扩展名更改为' .xls'而不是' .xlsx'。有人可以帮忙吗?
from bs4 import BeautifulSoup
import urllib.request
import os
url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")
soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))
FHFA = os.chdir('C:/US_Census/Directory')
seen = set()
for link in soup.find_all('a', href=True):
href = link.get('href')
if not any(href.endswith(x) for x in ['.xlsx']):
continue
file = href.split('/')[-1]
filename = file.rsplit('.', 1)[0]
if filename not in seen: # only retrieve file if it has not been seen before
seen.add(filename) # add the file to the set
url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
print(filename)
print(' ')
print("All files successfully downloaded.")
P.S。我知道你可以下载这个文件,但我正在网上抓取它来自动化一个特定的过程。
答案 0 :(得分:2)
您的问题出在 public ActionResult Generate(InputDataViewModel viewModel, string connectionId)
{
...
var Tasks = getResultsWithProgressBar(viewModel.jobId, connectionId);
...
return RedirectToAction("Details", new { id = job.JobDataId });
}
行。如果您转到网站并将鼠标悬停在Excel下载按钮上,您会看到有更长的链接url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
(请注意https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx
?)。所以你从来没有正确下载Excel文件。这是正确的行:
2017....238
其他一切都正常运作。