Can't scrape HTML table using BeautifulSoup

时间:2016-02-03 04:14:20

标签: python selenium beautifulsoup

I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site. Here's the table I'm looking to get data for...

<div class="sastrupp-class">
        <table>
            <tbody>
                <tr>
                    <td class="key">Thing I dont want 1</td>
                    <td class="value money">$1.23</td>

                    <td class="key">Thing I dont want 2</td>
                    <td class="value">99,999,999</td>

                    <td class="key">Target</td>
                    <td class="money value">$1.23</td>

                    <td class="key">Thing I dont want 3</td>
                    <td class="money value">$1.23</td>

                    <td class="key">Thing I dont want 4</td>
                    <td class="value percentage">1.23%</td>

                    <td class="key">Thing I dont want 5</td>
                    <td class="money value">$1.23</td>
                </tr>
            </tbody>
        </table>
    </div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want. I figured I could just look for the class that I'm searching for like this...

    output = soup.find('td', {'class':'key'})
    print(output)

but that doesn't return anything.

Important to note:

  1. < td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.

2.There are other < div>s with class="sastrupp-class" on the site.

  1. I'm obviously a beginner at this so let me know if I can help you help me. Any help/pointers would be appreciated.

1 个答案:

答案 0 :(得分:-1)

1)首先,要获得“目标”,您需要 find_all ,而不是 find 。然后,考虑到你确切知道你的目标将在哪个位置(在你给它的例子中是index = 2),可以像这样得到解决方案:

from bs4 import BeautifulSoup

html = """(YOUR HTML)"""

soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]

print my_key.text  # prints 'Target'

2)

  

还有其他&lt; div&gt; s在网站上有class =“sastrupp-class”

同样,您需要使用 find_all 选择所需的那个,然后选择正确的索引。

示例HTML:

<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>

要提取目标,您可以:

all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3]  # assuming you know exactly which index to look for