我如何找到这样的标签:
<us-applicant sequence="001" app-type="applicant" designation="us-only">
只能找到那些使用BS4的申请人序列=“001”(所以不是us-applicant sequence =“002”)?我很熟悉找到看起来更像的标签:
<applicant>APPLICANTNAME</applicant>
我正在寻找能够做到这一点的事情:
<us-applicant sequence="001" app-type="applicant" designation="us-only">
<some sub-tag>Data1</some sub-tag>
<us-applicant sequence="001" app-type="applicant" designation="us-only">
<some sub-tag>Data2</some sub-tag>
<us-applicant sequence="002" app-type="applicant" designation="us-only">
<some sub-tag>Data3</some sub-tag>
当我写一个vars = soup.findall(SOMETHING)
for var in vars:
data = vars.find_all('some sub-tag')
return(data.text)
只返回Data1和Data2,而不是Data3
答案 0 :(得分:0)
标签中不允许使用空白;因此,我更改了<some sub-tag>
之类的标签,用连字符替换了空白。
soup.findAll
找到HTML中带有标签'us-applicant'和'sequence''002'的元素列表。由于只有一个我们选择此列表的第0个元素。现在我们要求这个元素的元素带有'some-sub-tag'标签。最后,我们显示该元素的text
属性。
>>> HTML = '''
... <us-applicant sequence="001" app-type="applicant" designation="us-only">
... <some-sub-tag>Data1</some-sub-tag>
... <us-applicant sequence="001" app-type="applicant" designation="us-only">
... <some-sub-tag>Data1</some-sub-tag>
... <us-applicant sequence="002" app-type="applicant" designation="us-only">
... <some-sub-tag>Data1</some-sub-tag>
... '''
>>> import bs4
>>> soup = bs4.BeautifulSoup(HTML, 'lxml')
>>> one_applicant = soup.findAll('us-applicant', attrs={'sequence': '002'})[0]
>>> one_applicant
<us-applicant app-type="applicant" designation="us-only" sequence="002">
<some-sub-tag>Data1</some-sub-tag>
</us-applicant>
>>> sub_tag = one_applicant.find('some-sub-tag')
>>> sub_tag
<some-sub-tag>Data1</some-sub-tag>
>>> sub_tag.text
'Data1'