使用Beautiful Soup通过兄弟姐妹和父母解析儿童价值

时间:2018-03-08 20:27:11

标签: python html parsing

我从

中提取 I-want-ya 文字时遇到问题
<div class="field">
   <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>
   <div class="input">I-want-ya</div>
</div>

洗礼到目前为止:

browser = robobrowser.RoboBrowser(parser='html.parser')
browser.open(url)
browser = browser.parsed
soup = BeautifulSoup(str(browser), 'html.parser')

parsed_value = soup.select('div.labelx  + .input)

是否有机会获得 I-want-ya 值:

  <div class="input">I-want-ya</div>

由具有class =“labelx”的标签div和具有属性title =“Group”的子a的兄弟姐妹?

2 个答案:

答案 0 :(得分:1)

更新:现在占多个匹配

from bs4 import BeautifulSoup

s = '''<div class="field">
   <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>
   <div class="input">I-want-ya</div>
   <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>
   <div class="input">I-want-you-2</div>
</div>'''

soup = BeautifulSoup(s, 'html.parser')

divs = soup.find_all('div', attrs={'class': 'labelx'})
for div in divs:
    try:
        div.find('a', {'title': 'Group'})
        print(div.findNext('div', {'class': 'input'}).text)
    except:
        print('No match.')

给出:

I-want-ya
I-want-you-2

答案 1 :(得分:0)

假设我理解正确:

  • 找到包含所需div。{/ li>的class元素
  • 询问所有兄弟姐妹,获得第一个兄弟姐妹,然后获得那个兄弟姐妹的text
>>> HTML = '''\
... <div class="field">
...     <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>
...     <div class="input">I-want-ya</div>
... </div>'''
>>> import bs4
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> first_sib_div = soup.find('div', attrs={'class': 'labelx'})
>>> first_sib_div.fetchNextSiblings()[0].text
'I-want-ya'

编辑:这就应该是它。

>>> HTML = '''\
... <div class="field">
...     <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>
...     <div class="input">I-want-ya</div>
... </div>'''
>>> import bs4
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> first_div_link = soup.select('div.labelx > a[title="Group"]')[0]
>>> first_div_link.findParent().fetchNextSiblings()[0].text
'I-want-ya'

附录:在回答rahlf23的问题时添加。

>>> s = '''\
... <div class="field">
...     <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>
...         <div class="input">I-want-ya</div>
...     <div class="labelx"><a class="clickme" href="#h_group123" rel="#h_group123" title="Group">* Group</a></div>     
...         <div class="input">I-want-ya-too</div>
... </div>'''
>>> soup = bs4.BeautifulSoup(s, 'lxml')
>>> for item in soup.select('div.labelx > a[title="Group"]'):
...     item.findParent().fetchNextSiblings()[0].text
...     
'I-want-ya'
'I-want-ya-too'