Question

我在csv中有一个网站列表，我想在该列表上捕获所有pdf。

BeautifulSoup select可以在<a href>上正常工作，但是有一个以<data-url="https://example.org/abc/qwe.pdf">开始pdf链接的网站，汤没有任何东西。

我是否可以使用任何代码获取以“ data-url”开头并以.pdf结尾的所有内容？

我为乱码表示歉意。我还在学习。请让我知道是否可以澄清。

谢谢：D

csv看起来像这样

import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#Write csv into tuples
with open('links.csv') as f:
    url=[tuple(line) for line in csv.reader(f)]
print(url)

#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscrapping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

def url_response(url):
    global i
    final = a
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")
    for link in soup.select("a[href$='.pdf']"):
        #Translating captured URLs into local addresses
        filename = os.path.join(folder_location,link['href'].split('/')[-1])
        print(filename)
        #Writing files into said addresses
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url,link['href'])).content)
        #Rename files
        os.rename(filename,str(final)+"_"+ str(i)+".pdf")
        i = i + 1

#Loop the csv
for a,b in url:
    i = 0
    url_response(b)
`

Answer 1

如果beautifulsoup不能帮助您，则找到链接的正则表达式解决方案如下：

示例HTML：

 txt = """
        <html>
        <body>
        <p>
        <data-url="https://example.org/abc/qwe.pdf">
        </p>
        <p>
        <data-url="https://example.org/def/qwe.pdf">
        </p>
        </html>
        """

正则表达式代码以提取data-url内部的链接：

import re

re1 = '(<data-url=")' ## STARTS WITH
re2 = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))' # HTTP URL
re3 = '(">)' ## ENDS WITH

rg= re.compile(re1 + re2 + re3 ,re.IGNORECASE|re.DOTALL)
links = re.findall(rg, txt)

for i in range(len(links)):
    print(links[i][1])

输出：

https://example.org/abc/qwe.pdf
https://example.org/def/qwe.pdf

Answer 2

是，属性=带有$的值选择器以运算符结尾。与现有的href选择器一样，它只是另一种属性

soup.select('[data-url$=".pdf"]')

与Or语法结合

soup.select('[href$=".pdf"],[data-url$=".pdf"]')

然后可以使用has_attr进行测试，以确定对所检索的元素要采取的措施。

网页抓取-非href

2 个答案: