在BS4中查找具有多个元素的标记

时间:2017-07-31 22:41:11

标签: python python-3.x beautifulsoup tags bs4

我如何找到这样的标签:

<us-applicant sequence="001" app-type="applicant" designation="us-only">

只能找到那些使用BS4的申请人序列=“001”(所以不是us-applicant sequence =“002”)?我很熟悉找到看起来更像的标签:

<applicant>APPLICANTNAME</applicant>

我正在寻找能够做到这一点的事情:

 <us-applicant sequence="001" app-type="applicant" designation="us-only">
  <some sub-tag>Data1</some sub-tag>
 <us-applicant sequence="001" app-type="applicant" designation="us-only">
  <some sub-tag>Data2</some sub-tag>
 <us-applicant sequence="002" app-type="applicant" designation="us-only">
  <some sub-tag>Data3</some sub-tag>

当我写一个vars = soup.findall(SOMETHING) for var in vars: data = vars.find_all('some sub-tag') return(data.text)

只返回Data1和Data2,而不是Data3

1 个答案:

答案 0 :(得分:0)

标签中不允许使用空白;因此,我更改了<some sub-tag>之类的标签,用连字符替换了空白。

soup.findAll找到HTML中带有标签'us-applicant'和'sequence''002'的元素列表。由于只有一个我们选择此列表的第0个元素。现在我们要求这个元素的元素带有'some-sub-tag'标签。最后,我们显示该元素的text属性。

>>> HTML = '''
... <us-applicant sequence="001" app-type="applicant" designation="us-only">
...   <some-sub-tag>Data1</some-sub-tag>
...    <us-applicant sequence="001" app-type="applicant" designation="us-only">
...      <some-sub-tag>Data1</some-sub-tag>
...       <us-applicant sequence="002" app-type="applicant" designation="us-only">
...         <some-sub-tag>Data1</some-sub-tag>
...         '''
>>> import bs4
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> one_applicant = soup.findAll('us-applicant', attrs={'sequence': '002'})[0]
>>> one_applicant 
<us-applicant app-type="applicant" designation="us-only" sequence="002">
<some-sub-tag>Data1</some-sub-tag>
</us-applicant>
>>> sub_tag = one_applicant.find('some-sub-tag')
>>> sub_tag
<some-sub-tag>Data1</some-sub-tag>
>>> sub_tag.text
'Data1'