Question

我正在尝试在网页上下载并压缩特定zip文件的内容。

该网页具有标签和指向使用表结构的zip文件的链接，如下所示：

Filename    Flag    Link    
testfile_20190725_csv.zip  Y  zip
testfile_20190725_xml.zip  Y  zip 
testfile_20190724_csv.zip  Y  zip 
testfile_20190724_xml.zip  Y  zip 
testfile_20190723_csv.zip  Y  zip 
testfile_20190723_xml.zip  Y  zip 
(etc.)

上方的“ zip”一词是指向zip文件的链接。我只想下载CSV压缩文件，而只下载页面上显示的前一个x（例如7），而不下载XML压缩文件。

此处是网页代码的示例：

<tr>
 <td class="labelOptional_ind">
  testfile_20190725_csv.zip
 </td>
 </td>
 <td class="labelOptional" width="15%">
  <div align="center">
  Y
  </div>
 </td>
 <td class="labelOptional" width="15%">
  <div align="center">
   <a href="/test1/servlets/mbDownload?doclookupId=671334586">
    zip
   </a>
  </div>
 </td>
</tr>
<tr>
 <td class="labelOptional_ind">
  testfile_20190725_xml.zip
 </td>
 <td class="labelOptional" width="15%">
  <div align="center">
  N
  </div>
 </td>
 <td class="labelOptional" width="15%">
  <div align="center">
   <a href="/test1/servlets/mbDownload?doclookupId=671190392">
    zip
   </a>
  </div>
 </td>
</tr>
<tr>
 <td class="labelOptional_ind">
  testfile_20190724_csv.zip
 </td>
 <td class="labelOptional" width="15%">
  <div align="center">

我想我快到了，但是需要一点帮助。到目前为止，我能够做的是： 1.检查是否存在本地下载文件夹，如果不存在则创建一个 2.设置BeautifulSoup，从网页上读取所有主要标签（表格的第一列），并读取所有zip链接-即“ a hrefs” 3.为了进行测试，请手动将一个变量设置为标签之一，将另一个变量手动设置为其相应的zip文件链接，下载该文件并传输zip文件的CSV内容

我需要帮助的是：下载所有主要标签及其对应的链接，然后循环浏览每个标签，跳过任何XML标签/链接，并仅下载/流式传输CSV标签/链接

这是我的代码：

# Read zip files from page, download file, extract and stream output
from io import BytesIO
from zipfile import ZipFile
import urllib.request
import os,sys,requests,csv
from bs4 import BeautifulSoup

# check for download directory existence; create if not there
if not os.path.isdir('f:\\temp\\downloaded'):
    os.makedirs('f:\\temp\\downloaded')

# Get labels and zip file download links
mainurl = "http://www.test.com/"
url = "http://www.test.com/thisapp/GetReports.do?Id=12331"

# get page and setup BeautifulSoup
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

# Get all file labels and filter so only use CSVs
mainlabel = soup.find_all("td", {"class": "labelOptional_ind"})
for td in mainlabel:
    if "_csv" in td.text:
        print(td.text)

# Get all <a href> urls
for link in soup.find_all('a'):
    print(mainurl + link.get('href'))

# QUESTION: HOW CAN I LOOP THROUGH ALL FILE LABELS AND FIND ONLY THE
# CSV LABELS AND THEIR CORRESPONDING ZIP DOWNLOAD LINK, SKIPPING ANY
# XML LABELS/LINKS, THEN LOOP AND EXECUTE THE CODE BELOW FOR EACH, 
# REPLACING zipfilename WITH THE MAIN LABEL AND zipurl WITH THE ZIP 
# DOWNLOAD LINK?

# Test downloading and streaming
zipfilename = 'testfile_20190725_xml.zip'
zipurl = 'http://www.test.com/thisdownload/servlets/thisDownload?doclookupId=674992379'
outputFilename = "f:\\temp\\downloaded\\" + zipfilename

# Unzip and stream CSV file
url = urllib.request.urlopen(zipurl)
zippedData = url.read()

# Save zip file to disk
print ("Saving to ",outputFilename)
output = open(outputFilename,'wb')
output.write(zippedData)
output.close()

# Unzip and stream CSV file
with ZipFile(BytesIO(zippedData)) as my_zip_file:
   for contained_file in my_zip_file.namelist():
    with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
        for line in my_zip_file.open(contained_file).readlines():
            print(line)

Answer 1

要获取所有必需的链接，可以使用具有自定义功能的find_all()方法。该函数将搜索带有以<td>结尾的文本的"csv.zip"标签。

data是问题中的HTML代码段：

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for td in soup.find_all(lambda tag: tag.name=='td' and tag.text.strip().endswith('csv.zip')):
    link = td.find_next('a')
    print(td.get_text(strip=True), link['href'] if link else '')

打印：

testfile_20190725_csv.zip /test1/servlets/mbDownload?doclookupId=671334586
testfile_20190724_csv.zip

Answer 2

您无需捕获标签和URL的两个单独列表，而是可以捕获整行，检查标签是否为<div v-for="(video, videoIndex) in videos" :key="videoIndex" class="col-3"> <div class="col-12 mb-3 p-0"> <div class="row"> <b-form-checkbox switch :id="'video-'+video.id" :value="video.id" v-model="checked" @input="checkboxVal" unchecked-value="not_accepted" > </b-form-checkbox> </div> </div> </div>，然后使用URL下载。

csv

现在您可以使用以下方法测试标签并下载文件：

# Using the class name to identify the correct labels
mainlabel = soup.find_all("td", {"class": "labelOptional_ind"})

# find the containing row <tr> for each label 
fullrows =  [label.find_parent('tr') for label in mainlabel]

Windows中的Python网络抓取和下载特定的zip文件

2 个答案: