如果在FindAll中使用beautifulsoup

时间:2014-01-11 05:59:20

标签: beautifulsoup web-crawler

我正在尝试使用bs4抓取此IP地址。这里的ip是103.18.75.62

<div class="the-ip"><label id="a829266">1</label><label id="a814974">0</label><span id="a968168">3</span><label id="d735847">.</label><span id="d111988">1</span><span id="b284407">8</span><span id="b740896">.</span><label id="d817182">7</label><label id="e268019">5</label><span id="a721115">.</span><label id="e816439">6</label><span id="b903319">2</span></div>

我期待以下的工作

ip_div = soup.findAll('div' , class_ ='the-ip')
ips = ip[0].findAll('label' AND 'span')   // how to implement this AND ???
for i in ips:
    print i.get_text()

那么如何实现这个AND ???

1 个答案:

答案 0 :(得分:1)

selectdiv.the-ip *一起用作css选择器:

>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup('''
... <div class="the-ip">
...     <label id="a829266">1</label>
...     <label id="a814974">0</label>
...     <span id="a968168">3</span>
...     <label id="d735847">.</label>
...     <span id="d111988">1</span>
...     <span id="b284407">8</span>
...     <span id="b740896">.</span>
...     <label id="d817182">7</label>
...     <label id="e268019">5</label>
...     <span id="a721115">.</span>
...     <label id="e816439">6</label>
...     <span id="b903319">2</span>
... </div>
... ''')
>>> ''.join(el.text for el in soup.select('div.the-ip *'))
u'103.18.75.62'

我认为div.the-ip>*(或div.the-ip>label, div.the-ip>span)也应该有用。但这不适用于bs4。 (适用于lxml)

回答问题how to implement this AND

您的意思是 OR

您可以传递已编译的正则表达式模式而不是字符串:

>>> import re
>>>
>>> ip_div = soup.find('div' , class_='the-ip') # `find`, not `findAll` here.
>>> ''.join(el.text for el in ip_div.findAll(re.compile('^(label|span)$')))
u'103.18.75.62'

^(label|span)$匹配labelspan