我正在尝试在本地市政厅网站上将某些PDF进行网上抓取。我只想要某些日期,是否可以按文本搜索?
例如,我想要某些月份的产品。
我已经编写了代码来查找这些错误,但这给了我这个错误:
TypeError:字符串索引必须为整数
是用于日期所在的文本行。
这是我的代码:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
url = "https://www.gmcameetings.co.uk"
meeting_links = soup.find('a', {'href':"https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny"})
f = open(r"E:\Internship\WORK\GMCA\Getting PDFS\gmcabusinessminutelinks.txt", "w+")
for link in meeting_links:
if link['text'].find_all(["April 2018"],["May 2018"],["June 2018"],["July 2018"])>1:
r2 = requests.get(link['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
pdf_links = soup2.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
f.write(str(plink['href']) + ' ')
f.close()
是否可以做到这一点,或者是我编写的方式吗?
答案 0 :(得分:1)
您可以使用:包含bs4 4.7.1。
import requests
from bs4 import BeautifulSoup as bs
dates = ['July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')
links = []
for date in dates:
l = [item['href'] for item in soup.select('a:contains("' + date + '")')]
links.append(l)
在列表末尾平整:
final = [i for item in links for i in item]