我正在使用Python /请求从网站收集数据。理想情况下,我只需要最新的“银行”信息,该信息始终位于页面顶部。
我目前拥有的代码可以执行此操作,但是随后它尝试继续执行并遇到索引超出范围的错误。我对aspx页面不太满意,但是是否可以仅在“银行”标题下收集数据?
这是我到目前为止所拥有的:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping South Dakota Banking Activity Actions...')
url2 = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
r2 = requests.get(url2, headers=headers)
soup = BeautifulSoup(r2.text, 'html.parser')
mylist5 = []
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
print(tds[0].text, tds[1].text)
理想情况下,我也可以对信息进行切片,因此我只能显示活动或批准状态等。
答案 0 :(得分:0)
与以前相同
import requests
from bs4 import BeautifulSoup, Tag
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
print('Scraping South Dakota Banking Activity Actions...')
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
检查数据源,我们可以找到所需元素的ID(值表)。
banking = soup.find(id='secondarycontent')
此后,我们过滤掉不是标签的汤元素(例如NavigableString
或其他元素)。您也可以查看如何获取文本(有关其他选项,请选中Tag doc)。
blocks = [b for b in banking.table.contents if type(b) is Tag] # filter out NavigableString
texts = [b.text for b in blocks]
现在,如果这是您谈论 latest 时要实现的目标,则必须确定哪个月是最近的月份以及哪个月是之前的月份。
current_month_idx, last_month_idx = None, None
current_month, last_month = 'August 2019', 'July 2019' # can parse with datetime too
for i, b in enumerate(blocks):
if current_month in b.text:
current_month_idx = i
elif last_month in b.text:
last_month_idx = i
if all(idx is not None for idx in (current_month_idx, last_month_idx)):
break # break when both indeces are not null
assert current_month_idx < last_month_idx
curr_month_blocks = [b for i, b in enumerate(blocks) if current_month_idx < i < last_month_idx]
curr_month_texts = [b.text for b in curr_month_blocks]
答案 1 :(得分:0)
使用bs4 4.7.1 +,您可以使用:contains通过过滤出以后的月份来隔离最新的月份。我将在此SO answer中说明使用:not
过滤掉以后的同级兄弟的原理。简而言之,找到包含“ August 2019”(本月是动态确定的)的行,并抓住它及其所有同级元素,然后找到包含“ July 2019”及其所有同级元素的行,并将后者从前者中删除。>
import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx')
soup = bs(r.content, 'lxml')
months = [i.text for i in soup.select('[colspan="2"]:has(a)')][0::2]
latest_month = months[0]
next_month = months[1]
rows_of_interest = soup.select(f'tr:contains("{latest_month}"), tr:contains("{latest_month}") ~ tr:not(:contains("{next_month}"), :contains("{next_month}") ~ tr)')
results = []
for row in rows_of_interest:
data = [re.sub('\xa0|\s{2,}',' ',td.text) for td in row.select('td')]
if len(data) == 1:
data.extend([''])
results.append(data)
df = pd.DataFrame(results)
print(df)