刮取嵌套标签

时间:2015-05-06 18:21:58

标签: python beautifulsoup html-parsing

我知道这类问题经常出现,但是我一直在浏览,并没有看到类似的问题。

<div class="casts">
    <table cellpadding="0" cellspacing="0">
        <tbody>
            <tr>
                <td class="">
                    <a class="cast">
                        <span class="title">
                            Nested data 1 
                            <span class="schedule">
                                Nested data 2
                            </span>
                        </span>
                    </a>
                </td>
            </tr>
        </tbody>
    </table>
</div>

有多个td具有相同的结构,但为了简单起见,我删除了其余部分。假设我想从跨度中提取数据Nested data 1Nested data 2,我使用以下内容:

finda = soup.find_all('a', attrs={'class':'cast'})

for var in finda:
  var2 = var.find_all('span')

使用:

var2[1]

我能够拉出所有Nested data 2

但我无法仅提取Nested data 1

var2[0]

将返回Nested data2 Nested data1

1 个答案:

答案 0 :(得分:1)

这可以通过迭代每个跨度的子项以一种或多或少的简单方式实现:

<强> stack.html

<!DOCTYPE html>
<html lang="en">
<head>
  <title>StackO</title>
  <meta charset="utf-8">
</head>
<body>
  <div class="casts">
    <table cellpadding="0" cellspacing="0">
      <tbody>
        <tr>
          <td class="">
            <a class="cast">
              <span class="title">
                Nested data 1 
                <span class="schedule">
                  Nested data 2
                  <span class="moar-nesting">
                    Nested data 3
                  </span>
                </span>
                Nested data 4
              </span>
            </a>
          </td>
        </tr>
      </tbody>
    </table>
  </div>
</body>
</html>

同时,在ipython的土地上......

In [1]: from bs4 import BeautifulSoup, NavigableString, Comment

In [2]: with open('stack.html', 'r') as f:
   ...:     markup = f.read()
   ...:

In [3]: soup = BeautifulSoup(markup)

In [4]: casts = soup.find_all('a', attrs={'class': 'cast'})

In [5]: cast = casts[0]

In [6]: for span in cast.find_all('span'):
   ...:     for child in span.children:
   ...:         if isinstance(child, NavigableString) and not isinstance(child, Comment) and str(child).strip() != "":
   ...:             print '"{}"'.format(str(child).strip())
   ...:
"Nested data 1"
"Nested data 4"
"Nested data 2"
"Nested data 3"

In [10]: