使用BeautifulSoup尝试嵌套Scrape

时间:2015-02-01 14:54:16

标签: python html beautifulsoup

我的代码如下:

<h1><a name="hello">Hello</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>My Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Your Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>
<h1 name="goodbye"><a>Goodbye</a></h1>
<div class="colmask">
<div class="box box_1">
<h4><a>Their Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
<div class="box box_2">
<h4><a>Our Favorite Number is</a></h4>
<ul><li><a>1</a></li></ul>
<ul><li><a>2</a></li></ul>
<ul><li><a>3</a></li></ul>
<ul><li><a>4</a></li></ul>
</div>
</div>

我没有正确循环代码而且我没有正确地知道如何迭代,因为我将所有值组合在一起。有人能引导我走上正轨吗?我尝试使用findNext()nextSibling()findAll()方法,但我失败了。

我希望的输出是:

Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye: Their Favorite Number is: 1
Goodbye: Their Favorite Number is: 2
Goodbye: Their Favorite Number is: 3
Goodbye: Their Favorite Number is: 4
Goodbye: Our Favorite Number is: 1
Goodbye: Our Favorite Number is: 2
Goodbye: Our Favorite Number is: 3
Goodbye: Our Favorite Number is: 4

1 个答案:

答案 0 :(得分:0)

如果你遇到nextSibling问题,那是因为你的html实际上是这样的:

<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">

查看</h1>之后的换行符?即使换行不可见,它仍然被视为文本,因此它变为BeautifulSoup元素(NavigableString),并且它被视为nextSibling标记的<h1>

当尝试获取以下<div>的第三个孩子时,换行符也会出现问题:

<div>
  <div>hello</div>
  <div>world</div>
  <div>goodbye</div>
<div>

以下是孩子的编号:

<div>\n #<---newline plus spaces at start of next line = child 0
  <div>hello</div>\n #<--newline plus spaces at start of next line = child 2
  <div>world</div>\n #<--newline plus spaces at start of next line = child 4
  <div>goodbye</div>\n #<--newline = child 6
<div>

div实际上是儿童编号1,3和5.如果您在解析html时遇到问题,那么101%的时间是因为每行末尾的换行符都会使您绊倒。新线始终必须考虑并考虑到您对事物所在位置的思考。

要在此处获取<div>标记:

<h1><a name="hello">Hello</a></h1>\n #<---newline
<div class="colmask">

...你可以写:

h1.nextSibling.nextSibling

但要跳过标签之间的所有空格,使用findNextSibling()会更容易,它允许您指定要查找的下一个兄弟的标记名称:

findNextSibling('div')

以下是一个例子:

from BeautifulSoup import BeautifulSoup

with open('data2.txt') as f:
    html = f.read()

soup = BeautifulSoup(html)

for h1 in soup.findAll('h1'):
    colmask_div = h1.findNextSibling('div')

    for box_div in colmask_div.findAll('div'):
        h4 = box_div.find('h4')

        for ul in box_div.findAll('ul'):
            print'{} : {} : {}'.format(h1.text, h4.text, ul.li.a.text)



--output:--
Hello : My Favorite Number is : 1
Hello : My Favorite Number is : 2
Hello : My Favorite Number is : 3
Hello : My Favorite Number is : 4
Hello : Your Favorite Number is : 1
Hello : Your Favorite Number is : 2
Hello : Your Favorite Number is : 3
Hello : Your Favorite Number is : 4
Goodbye : Their Favorite Number is : 1
Goodbye : Their Favorite Number is : 2
Goodbye : Their Favorite Number is : 3
Goodbye : Their Favorite Number is : 4
Goodbye : Our Favorite Number is : 1
Goodbye : Our Favorite Number is : 2
Goodbye : Our Favorite Number is : 3
Goodbye : Our Favorite Number is : 4