使用BeautifulSoup刮除没有唯一标识符的元素

时间:2018-10-22 17:55:11

标签: python python-3.x beautifulsoup python-requests

以前,我在Python中进行过少量的Web抓取工作,但我一直停留在可能相当简单的问题上。

我想从this page上的表格中提取汇率。

我可以获取诸如单个元素或所有费率之类的东西(因为它们都列在“ fccu__slash”类下,但我不知道如何以可用格式逐行获取结果。 / p>

这是我代码的相关部分:

Traceback (most recent call last):
  File "C:/Python27/testht.py", line 21, in <module>
    "Referer": "https://REDACTED.com/pb/a/"
  File "C:\Python27\lib\site-packages\requests\api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python27\lib\site-packages\requests\adapters.py", line 511, in send
    raise SSLError(e, request=request)
SSLError: HTTPSConnectionPool(host='REDACTED.com', port=443): Max retries exceeded with url: /pb/s/api/auth/login (Caused by SSLError(SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:726)'),))

这会输出我想要的所有信息,但不是可用的格式。

理想情况下,我只希望使用感兴趣的术语刮擦CD,并以这种格式输出结果(我只关心利率而不是APY):

FCCU_url = "https://www.fccu.org/Rates/CD-Rates"
FCCU_resp = requests.get(FCCU_url, timeout=3)
FCCU_soup = BeautifulSoup(FCCU_resp.content, "html.parser")
for elem in FCCU_soup.find_all("td"):
    try:
        print(elem.contents[0])
    except IndexError:
        print(elem.contents)

这些不是我关心的特定对象,但是一旦我了解了如何做,我想自己进行调整。

谢谢您的帮助。

3 个答案:

答案 0 :(得分:2)

尝试以下代码以获取所需的输出:

FCCU_url = "https://www.fccu.org/Rates/CD-Rates"
FCCU_resp = requests.get(FCCU_url, timeout=3)
FCCU_soup = BeautifulSoup(FCCU_resp.content, "html.parser")
for elem in FCCU_soup.select("tbody tr"):
    cells = [td for td in elem.findChildren('td')]
    data = [cells[0].text, cells[2].span.text, cells[3].span.text]
    print(data)

输出:

['3 Month', '0.65%', '0.75%']
['6 Month', '1.44%', '1.59%']
['1 Year', '2.13%', '2.37%']
['2 Year', '2.37%', '2.62%']
['3 Year', '2.27%', '2.52%']
['4 Year', '2.37%', '2.62%']
['5 Year', '2.96%', '3.20%']
['9 Month', '0.95%', '1.09%']
['19 Month', '1.98%', '2.08%']
['2 Year²', '2.27%', '2.52%']
['4 Year³', '2.32%', '2.57%']
['2 Year', '2.27%', 'N/A']

答案 1 :(得分:0)

完全刮掉html表,然后处理所需的单个列。

熊猫read_html在这方面做得很好

首先找到表元素

tableobject=FCCU_soup.find_all("table")

将其传递给熊猫

data=pd.read_html(str(tableobject))

然后拉出不需要的列。

答案 2 :(得分:0)

我尝试使用您的代码,并使用itertools将答案按6个元素分组。

import requests
from bs4 import BeautifulSoup
from itertools import zip_longest
FCCU_url = "https://www.fccu.org/Rates/CD-Rates"
FCCU_resp = requests.get(FCCU_url, timeout=3)
FCCU_soup = BeautifulSoup(FCCU_resp.content, "lxml")
result = []
for e in FCCU_soup.findAll("td"):
    if e.find_all("span"):
        [result.append(sp.text) for sp in e.find_all("span")]
    else:
        result.append(e.text)

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return list(zip_longest(*args, fillvalue=fillvalue))

print(grouper(result,6))

输出:

  

[(b'3个月',b'$ 500',b'0.65%',b'0.65%',b'0.75%',b'0.75%'),(b'6个月',b' $ 500',b'1.44%',b'1.45%',b'1.59%',b'1.60%'),(b'1年',b'$ 500',b'2.13%',b'2.15% ',b'2.37%',b'2.40%'),(b'2 Year',b'$ 500',b'2.37%',b'2.40%',b'2.62%',b'2.65%' ),(b'3年',b'$ 500',b'2.27%',b'2.30%',b'2.52%',b'2.55%'),(b'4年',b'$ 500' ,b'2.37%',b'2.40%',b'2.62%',b'2.65%'),...