Question

我正试图刮掉＆＃39; .xlsx＆＃39;来自Tax Foundation网站的文件。可悲的是，我一直收到一条错误消息：Excel cannot open the file '2017-FF-For-Website-7-10-2017.xlsx because the file format or file extension is not valid. verify that the file has not been corrupted and that the file extension matches the format of the file。我做了一些研究，并说它解决这个问题的方法是将文件扩展名更改为＆＃39; .xls＆＃39;而不是＆＃39; .xlsx＆＃39;。有人可以帮忙吗？

from bs4 import BeautifulSoup
import urllib.request
import os

url = urllib.request.urlopen("https://taxfoundation.org/facts-figures-2017/")

soup = BeautifulSoup(url, from_encoding=url.info().get_param('charset'))

FHFA = os.chdir('C:/US_Census/Directory')

seen = set()
for link in soup.find_all('a', href=True):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.xlsx']):
        continue

    file = href.split('/')[-1]
    filename = file.rsplit('.', 1)[0]
    if filename not in seen:  # only retrieve file if it has not been seen before
        seen.add(filename)  # add the file to the set
        url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)
    print(filename)

print(' ')
print("All files successfully downloaded.")

P.S。我知道你可以下载这个文件，但我正在网上抓取它来自动化一个特定的过程。

Answer 1

您的问题出在public ActionResult Generate(InputDataViewModel viewModel, string connectionId) { ... var Tasks = getResultsWithProgressBar(viewModel.jobId, connectionId); ... return RedirectToAction("Details", new { id = job.JobDataId }); }行。如果您转到网站并将鼠标悬停在Excel下载按钮上，您会看到有更长的链接url = urllib.request.urlretrieve('https://taxfoundation.org/' + href, file)（请注意https://files.taxfoundation.org/20170710170238/2017-FF-For-Website-7-10-2017.xlsx？）。所以你从来没有正确下载Excel文件。这是正确的行：

2017....238

其他一切都正常运作。

如何更改文件扩展名？

1 个答案: