我知道这类问题经常出现,但是我一直在浏览,并没有看到类似的问题。
<div class="casts">
<table cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td class="">
<a class="cast">
<span class="title">
Nested data 1
<span class="schedule">
Nested data 2
</span>
</span>
</a>
</td>
</tr>
</tbody>
</table>
</div>
有多个td具有相同的结构,但为了简单起见,我删除了其余部分。假设我想从跨度中提取数据Nested data 1
和Nested data 2
,我使用以下内容:
finda = soup.find_all('a', attrs={'class':'cast'})
for var in finda:
var2 = var.find_all('span')
使用:
var2[1]
我能够拉出所有Nested data 2
但我无法仅提取Nested data 1
var2[0]
将返回Nested data2 Nested data1
答案 0 :(得分:1)
这可以通过迭代每个跨度的子项以一种或多或少的简单方式实现:
<强> stack.html 强>:
<!DOCTYPE html>
<html lang="en">
<head>
<title>StackO</title>
<meta charset="utf-8">
</head>
<body>
<div class="casts">
<table cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td class="">
<a class="cast">
<span class="title">
Nested data 1
<span class="schedule">
Nested data 2
<span class="moar-nesting">
Nested data 3
</span>
</span>
Nested data 4
</span>
</a>
</td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
同时,在ipython的土地上......
In [1]: from bs4 import BeautifulSoup, NavigableString, Comment
In [2]: with open('stack.html', 'r') as f:
...: markup = f.read()
...:
In [3]: soup = BeautifulSoup(markup)
In [4]: casts = soup.find_all('a', attrs={'class': 'cast'})
In [5]: cast = casts[0]
In [6]: for span in cast.find_all('span'):
...: for child in span.children:
...: if isinstance(child, NavigableString) and not isinstance(child, Comment) and str(child).strip() != "":
...: print '"{}"'.format(str(child).strip())
...:
"Nested data 1"
"Nested data 4"
"Nested data 2"
"Nested data 3"
In [10]: