美丽的汤如何选择带有空格的<a href>和<td>元素

时间:2019-10-05 05:40:13

标签: python web-scraping beautifulsoup

我正在尝试使用BeautifulSoup从表中选择日期,URL,描述和其他URL,并且由于存在奇怪的空白而无法访问它们:

到目前为止,我已经写过:

import urllib
import urllib.request
from bs4 import BeautifulSoup 

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')

test1 = soup.findAll("td", {"nowrap" : "nowrap"})
test2 = [item.text.strip() for item in test1]

enter image description here

2 个答案:

答案 0 :(得分:0)

不幸的是,没有类或id HTML属性可以快速识别要抓取的表。经过实验后,我发现它是索引4的表。

接下来,我们通过将标头与数据分开来忽略标头,该数据仍然具有仅以四分之一分隔的表行。我们可以使用try-except块跳过它们,因为它们仅包含一个表数据标签。

我注意到描述由制表符分隔,因此我在\t上拆分了文本。

对于网址,我使用.get('href')而不是['href'],因为并非所有锚定标签都具有根据我的经验抓取的href属性。这样可以避免发生这种情况下的错误。最后,第二个锚标记并不总是出现,因此它也包装在try-except块中。

data = []
table = soup.find_all('table')[4] # target the specific table
header, *rows = table.find_all('tr')

for row in rows:
    try:
        litigation, date, complaint = row.find_all('td')
    except ValueError:
        continue # ignore quarter rows

    id = litigation.text.strip().split('-')[-1]
    date = date.text.strip()
    desc = complaint.text.strip().split('\t')[0]
    lit_url = litigation.find('a').get('href')

    try:
        comp_url = complaint.find('a').get('href')
    except AttributeError:
        comp_ulr = None # complaint url is optional

    info = dict(id=id, date=date, desc=desc, lit_url=lit_url, comp_url=comp_url)
    data.append(info)

答案 1 :(得分:0)

在bs4 4.7.1中,您可以将:has和nth-of-type与next_sibling结合使用以获取这些列

from bs4 import BeautifulSoup 
import requests, re

def make_soup(url):
    the_page = requests.get(url)
    soup_data = BeautifulSoup(the_page.content, "html.parser")
    return soup_data

soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
releases = []
links = []
dates = []
descs = [] 
addit_urls = []

for i in soup.select('td:nth-of-type(1):has([href^="/litigation/litreleases/"])'):
    sib_sib = i.next_sibling.next_sibling.next_sibling.next_sibling
    releases+= [i.a.text]
    links+= [i.a['href']]
    dates += [i.next_sibling.next_sibling.text.strip()]
    descs += [re.sub('\t+|\s+',' ',sib_sib.text.strip())]
    addit_urls += ['N/A' if sib_sib.a is None else sib_sib.a['href']]

result = list(zip(releases, links, dates, descs, addit_urls))
print(result)