我正在尝试使用BeautifulSoup从表中选择日期,URL,描述和其他URL,并且由于存在奇怪的空白而无法访问它们:
到目前为止,我已经写过:
import urllib
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
test1 = soup.findAll("td", {"nowrap" : "nowrap"})
test2 = [item.text.strip() for item in test1]
答案 0 :(得分:0)
不幸的是,没有类或id HTML属性可以快速识别要抓取的表。经过实验后,我发现它是索引4的表。
接下来,我们通过将标头与数据分开来忽略标头,该数据仍然具有仅以四分之一分隔的表行。我们可以使用try-except块跳过它们,因为它们仅包含一个表数据标签。
我注意到描述由制表符分隔,因此我在\t
上拆分了文本。
对于网址,我使用.get('href')
而不是['href']
,因为并非所有锚定标签都具有根据我的经验抓取的href属性。这样可以避免发生这种情况下的错误。最后,第二个锚标记并不总是出现,因此它也包装在try-except块中。
data = []
table = soup.find_all('table')[4] # target the specific table
header, *rows = table.find_all('tr')
for row in rows:
try:
litigation, date, complaint = row.find_all('td')
except ValueError:
continue # ignore quarter rows
id = litigation.text.strip().split('-')[-1]
date = date.text.strip()
desc = complaint.text.strip().split('\t')[0]
lit_url = litigation.find('a').get('href')
try:
comp_url = complaint.find('a').get('href')
except AttributeError:
comp_ulr = None # complaint url is optional
info = dict(id=id, date=date, desc=desc, lit_url=lit_url, comp_url=comp_url)
data.append(info)
答案 1 :(得分:0)
在bs4 4.7.1中,您可以将:has和nth-of-type与next_sibling结合使用以获取这些列
from bs4 import BeautifulSoup
import requests, re
def make_soup(url):
the_page = requests.get(url)
soup_data = BeautifulSoup(the_page.content, "html.parser")
return soup_data
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
releases = []
links = []
dates = []
descs = []
addit_urls = []
for i in soup.select('td:nth-of-type(1):has([href^="/litigation/litreleases/"])'):
sib_sib = i.next_sibling.next_sibling.next_sibling.next_sibling
releases+= [i.a.text]
links+= [i.a['href']]
dates += [i.next_sibling.next_sibling.text.strip()]
descs += [re.sub('\t+|\s+',' ',sib_sib.text.strip())]
addit_urls += ['N/A' if sib_sib.a is None else sib_sib.a['href']]
result = list(zip(releases, links, dates, descs, addit_urls))
print(result)