使用beautifulsoup访问嵌套元素

时间:2017-10-15 20:16:25

标签: python html beautifulsoup html-parsing

我有以下html:

<div id="contentDiv">
    <!-- START FILER DIV -->
    <div style="margin: 15px 0 10px 0; padding: 3px; overflow: hidden; background-color: #BCD6F8;">
    <div class="mailer">Mailing Address
        <span class="mailerAddress">500 ORACLE PARKWAY</span>
        <span class="mailerAddress">MAIL STOP 5 OP 7</span>
        <span class="mailerAddress">REDWOOD CITY CA 94065</span>
     </div>

我正在尝试访问“500 ORACLE PARKWAY”和“MAIL STOP 5 OP&amp;”,但我找不到办法。我的尝试是这样的:

for item in soup.findAll("span", {"class" : "mailerAddress"}):
    if item.parent.name == 'div':
        return_list.append(item.contents)

编辑:我忘了提到html之后有一些元素使用类似的标签,所以当我只想要前两个时它会捕获所有这些元素。

修改:链接:https://www.sec.gov/cgi-bin/browse-edgar?CIK=orcl

2 个答案:

答案 0 :(得分:0)

我将尝试用我们掌握的一些信息来回答这个问题。如果您只想在网页上使用某个类的前两个元素,则可以使用切片。

#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
...

for(int r = 0; r < 10; r++)
{
    if(!getline(infile, names[r]))
        break;
    for(int c = 0; c < 3; c++)
    {
        string temp;
        if(!getline(infile, temp))
            break;
        try
        {
            scores[r][c] = std::stod(temp);
        }
        catch(...)
        {
        }
    }
}

for(int r = 0; r < 10; r++)
{
    cout << names[r] << endl;
    cout << fixed << setprecision(2) << endl;
    for (int c = 0; c < 3; c++)
        cout << scores[r][c] << endl;
}

答案 1 :(得分:0)

试试这个:

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.sec.gov/cgi-bin/browse-edgar?CIK=orcl").text
soup = BeautifulSoup(res,'lxml')
for item in soup.find_all(class_="mailerAddress")[:2]:
    print(item.text)

结果:

500 ORACLE PARKWAY
MAIL STOP 5 OP 7