我在csv中有一个网站列表,我想在该列表上捕获所有pdf。
BeautifulSoup select可以在<a href>
上正常工作,但是有一个以<data-url="https://example.org/abc/qwe.pdf">
开始pdf链接的网站,汤没有任何东西。
我是否可以使用任何代码获取以“ data-url”开头并以.pdf结尾的所有内容?
我为乱码表示歉意。我还在学习。请让我知道是否可以澄清。
谢谢:D
csv看起来像这样
123456789 https://example.com
234567891 https://example2.com
import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup
#Write csv into tuples
with open('links.csv') as f:
url=[tuple(line) for line in csv.reader(f)]
print(url)
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscrapping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
def url_response(url):
global i
final = a
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Translating captured URLs into local addresses
filename = os.path.join(folder_location,link['href'].split('/')[-1])
print(filename)
#Writing files into said addresses
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
#Rename files
os.rename(filename,str(final)+"_"+ str(i)+".pdf")
i = i + 1
#Loop the csv
for a,b in url:
i = 0
url_response(b)
`
答案 0 :(得分:0)
如果beautifulsoup不能帮助您,则找到链接的正则表达式解决方案如下:
示例HTML:
txt = """
<html>
<body>
<p>
<data-url="https://example.org/abc/qwe.pdf">
</p>
<p>
<data-url="https://example.org/def/qwe.pdf">
</p>
</html>
"""
正则表达式代码以提取data-url
内部的链接:
import re
re1 = '(<data-url=")' ## STARTS WITH
re2 = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))' # HTTP URL
re3 = '(">)' ## ENDS WITH
rg= re.compile(re1 + re2 + re3 ,re.IGNORECASE|re.DOTALL)
links = re.findall(rg, txt)
for i in range(len(links)):
print(links[i][1])
输出:
https://example.org/abc/qwe.pdf
https://example.org/def/qwe.pdf
答案 1 :(得分:0)
是,属性=带有$的值选择器以运算符结尾。与现有的href选择器一样,它只是另一种属性
soup.select('[data-url$=".pdf"]')
与Or语法结合
soup.select('[href$=".pdf"],[data-url$=".pdf"]')
然后可以使用has_attr进行测试,以确定对所检索的元素要采取的措施。