如何从表格中提取特定内容

时间:2018-09-03 01:58:29

标签: python beautifulsoup

我正在使用Python和BeautifulSoup开发网络抓取程序。刮除表格时遇到问题。

我的问题是,我只需要提取选定的<td>标签,而不是整个表。

我只需要52周最高价,52周最低价,每股收益和市净率的数字即可。

反正我能做到吗?

样品表

<table id="TABLE_1">
  <tbody id="TBODY_2">
    <tr id="TR_3">
      <td id="TD_4">
        <strong id="STRONG_5">52-Week High:</strong>
      </td>
      <td id="TD_6">
        1,116.00
      </td>
      <td id="TD_7">
        <strong id="STRONG_8">Earnings Per Share TTM (EPS):</strong>
      </td>
      <td id="TD_9">
        47.87 (15.57%)
      </td>
      <td id="TD_10">
        <strong id="STRONG_11">Price to Book Value (P/BV):</strong>
      </td>
      <td id="TD_12">
        2.5481125565
      </td>
    </tr>
    <tr id="TR_13">
      <td id="TD_14">
        <strong id="STRONG_15">52-Week Low:</strong>
      </td>
      <td id="TD_16">
        867.50
      </td>
      <td id="TD_17">
        <strong id="STRONG_18">Price-Earnings Ratio TTM (P/E):</strong>
      </td>
      <td id="TD_19">
        20.8272404429
      </td>
      <td id="TD_20">
        <strong id="STRONG_21">Return on Equity (ROE):</strong>
      </td>
      <td id="TD_22">
        12.42%
      </td>
    </tr>
    <tr id="TR_23">
      <td id="TD_24">
        <strong id="STRONG_25">Fair Value:</strong>
      </td>
      <td id="TD_26">
        -
      </td>
      <td id="TD_27">
        <strong id="STRONG_28">Dividends Per Share (DPS):</strong>
      </td>
      <td id="TD_29">
        -
      </td>
      <td id="TD_30">
        <strong id="STRONG_31">Recommendation:</strong>
      </td>
      <td id="TD_32">
        None<span id="SPAN_33"></span>
      </td>
    </tr>
    <tr id="TR_34">
      <td id="TD_35">
        <strong id="STRONG_36">Last Price:</strong>
      </td>
      <td id="TD_37">
        <span id="SPAN_38"></span> <span id="SPAN_39">984.5</span>
      </td>
    </tr>
  </tbody>
</table>

我还展示了我的代码供您参考。

任何帮助将不胜感激!谢谢!

from bs4 import BeautifulSoup as soup
from urllib.request import Request, urlopen
import pandas as pd

myurl = "https://www.investagrams.com/Stock/ac"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(myurl,headers=hdr)

# Open connection to website
uClient = urlopen(req)

# Offloads the content to variable
page_html = uClient.read()

#just closing it
uClient.close()

# html parser
page_soup = soup(page_html, "html.parser")

table = page_soup.find("div", {"id":"FundamentalAnalysisPanel"}).find("table")

print(table.text)

2 个答案:

答案 0 :(得分:1)

您可以使用findNextSibling方法进行操作。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.investagrams.com/Stock/ac')
soup = BeautifulSoup(r.text)

# specify table parameters for which you want to find values
parameters = ['52-Week High:', '52-Week Low:', 'Earnings Per Share TTM (EPS):', 'Price-Earnings Ratio TTM (P/E):', 'Price to Book Value (P/BV):']

# iterate all <td> tags and print text of the next sibling (with value),
# if this <td> contains specified parameter.
for td in soup.findAll('td'):
     for p in parameters:
         if td.find('strong', text=p) is not None:
             print(td.findNextSibling().text.strip())

结果:

1,116.00
47.87 (15.57%)
2.5481125565
867.50
20.8272404429

答案 1 :(得分:0)

这可能就是您想要的

    page_soup = soup(req.data.decode('utf-8'))

    #tables = page_soup.find_all('table')
    tables = page_soup.find_all('td')

    df = pd.read_html(str(tables[i]))

其中i是您想要的表