我正在尝试使用BeautifulSoup
和Selenium
从网站上删除pdf
我尝试过以不同的方式使用find_all()
功能,但没有得到我需要的结果。
基本上我想要做的是获取每个季度(例如2014年第四季度 - 2015年第三季度)和国家(马来西亚,印度尼西亚等)的链接这样我就可以按照季度将pdf文件夹到一个文件夹中,然后在那里为国家/地区写一个子文件夹。
以下是该网站的html片段:
</div><a class="accord-header accord-header-5049 accord-header-supply-chain-resources"><div>Supply Chain</div></a><div class="accord-body accord-body-5049 accord-body-supply-chain-resources" style="display: none;"><ul>
<li class="folder">
<div>Q4 2014 – Q3 2015</div>
<ul style="display: none;">
<li class="folder">
<div>Indonesia</div>
<ul style="display: none;">
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Indonesia/MNA KualaTanjung_L1--160122.pdf">MNA KualaTanjung</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Indonesia/MNA Paya Pasir_L1 --160122.pdf">MNA Paya Pasir</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Indonesia/MNA Pulo Gadung_L1 --160122.pdf">MNA Pulo Gadung</a></li>
</ul>
</li>
<li class="folder">
<div>Malaysia</div>
<ul style="display: none;">
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Malaysia/BEO_L1 -- 160122.pdf">BEO Bintulu</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Malaysia/LDEO_L1 -- 160122.pdf">LDEO Lahad Datu</a></li>
</ul>
</li>
<li class="folder">
<div>Destination Countries</div>
<ul style="display: none;">
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/Bangladesh_160122 -- new.pdf">Bangladesh</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/China- Oleochemical_160122 -- new.pdf">China- Oleochemical</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/China- Specialty Fats_160122 -- new.pdf">China- Specialty Fats</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/Europe_Brake -- 160122 -- new.pdf">Europe_Brake</a></li>
<li class="document"><a href="/sustainability/wp-content/themes/wilmar/sustainability/assets/../downloads/wilmar/resource/Destination/Europe_Rotterdam -- 160122 -- new.pdf">Europe_Rotterdam</a></li>
</ul>
</li>
</ul>
</li>
<li class="folder">
<div>Q1 – Q4 2015</div>
<ul style="display: none;">
<li class="folder">
<div>Indonesia</div>
<ul style="display: none;">
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNA-KTJ_L1.pdf">MNA KualaTanjung</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNA-PG_L1.pdf">MNA Paya Pasir</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNA-PPS_L1.pdf">MNA Pulo Gadung</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_MNS-BTG_L1.pdf">MNS Bitung</a></li>
</ul>
</li>
<li class="folder">
<div>Malaysia</div>
<ul style="display: none;">
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_BEO_L1.pdf">BEO Bintulu</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_LDEO_L1.pdf">LDEO Lahad Datu</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_NatOleo_L1.pdf">NatOleo Pasir Gudang</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_PGEO-Lumut_L1.pdf">PGEO Lumut</a></li>
</ul>
</li>
<li class="folder">
<div>Destination Countries</div>
<ul style="display: none;">
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_Bangladesh_L1.pdf">Bangladesh</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_China-Oleochemical_L1.pdf">China- Oleochemical</a></li>
<li class="document"><a href="/sustainability/wp-content/uploads/2016/09/160427_China-Specialty Fats_L1.pdf">China- Specialty Fats</a></li>
</ul>
</li>
</ul>
</li>
我可以使用下面的代码提取那里的网址,但不能按季度和国家区分它们,这样对我没那么帮助。
for li in soup.find_all(class_="document"):
try:
href = (li.a.get('href'))
if re.search(match, href):
links.append(href)
except KeyError:
pass
答案 0 :(得分:1)
该页面正在加载JavaScript,以便绕过我只是使用Selenium加载页面并获取html。我还修改了代码以仅定位供应链部分。
编辑:
这个新版本保持相同的浏览器打开,下载pdfs(到顶部附近的download_dir
设置)并将它们移动到正确的目录结构中。将在运行此脚本的任何位置创建目录树。
由于该网站似乎采用了反机器人功能,我从3-9秒随机化了time.sleep
(可以轻松更改)。另一件事是,如果脚本由于某种原因停止,您应该能够从下载停止的位置进行备份。代码检查文件是否已存在于正确的目录中,并且只有在尚不存在的情况下才会下载。
为了节省时间(似乎总共有525个pdf),我只从第一季度目录下载了pdf进行测试,但如果有任何错误,请告诉我!
import os
import random
import shutil
import time
from collections import defaultdict
from urllib.parse import quote, urljoin
from bs4 import BeautifulSoup
from bs4.element import Tag
from selenium import webdriver
# Setup Chrome to download PDFs
download_dir = '/home/lettuce/Downloads' # "D:\z_Temp\Wilmar_Traceability" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()
profile = {
"plugins.plugins_list": [{
"enabled": False,
"name": "Chrome PDF Viewer"
}],
# Disable Chrome's PDF Viewer
"download.default_directory": download_dir,
"download.extensions_to_open": "applications/pdf"
}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome(chrome_options=options)
# Get page source of all PDF links
url = 'http://www.wilmar-international.com/sustainability/resource-library/'
driver.get(url)
page_html = driver.page_source
# Parse out PDF links and a structure for the folders
soup = BeautifulSoup(page_html, 'lxml')
supply_chain = soup.select_one(
'#text-wrap-sub > div.sub_cont_left > div > div > div > '
'div.accord-body.accord-body-5049.accord-body-supply-chain-resources > ul'
)
result = {}
for li in supply_chain:
if isinstance(li, Tag):
quarter = li.div.text
documents = defaultdict(list)
for folder in li.find_all('li', class_='folder'):
country = folder.div.text
for document in folder.find_all('li', class_="document"):
documents[country].append(document.a['href'])
result[quarter] = documents
supply_chain_dir = os.path.join(os.getcwd(), 'SupplyChain')
os.makedirs(supply_chain_dir, exist_ok=True)
for quarter, countries in result.items():
# create quarter directory
quarter_dir = os.path.join(supply_chain_dir, quarter)
os.makedirs(quarter_dir, exist_ok=True)
for country, documents in countries.items():
# create country directory
country_dir = os.path.join(quarter_dir, country)
os.makedirs(country_dir, exist_ok=True)
for document in documents:
filename = document.split('/')[-1]
if not os.path.exists(os.path.join(country_dir, filename)):
# download pdf and move it to country directory
driver.get(urljoin(url, quote(document)))
time.sleep(random.randint(3, 9))
shutil.move(
src=os.path.join(download_dir, filename),
dst=country_dir
)
driver.quit()