Question

我有一个XML文件，其中包含已定义的结构但标签数量不同，例如

file1.xml：

<document>
  <subDoc>
    <id>1</id>
    <myId>1</myId>
  </subDoc>
</document>

file2.xml：

<document>
  <subDoc>
    <id>2</id>
  </subDoc>
</document>

现在我想查看标签myId是否退出。所以我做了以下事情：

data = open("file1.xml",'r').read()
xml = BeautifulSoup(data)

hasAttrBs = xml.document.subdoc.has_attr('myID')
hasAttrPy = hasattr(xml.document.subdoc,'myID')
hasType = type(xml.document.subdoc.myid)

结果是 file1.xml：

hasAttrBs -> False
hasAttrPy -> True
hasType ->   <class 'bs4.element.Tag'>

file2.xml：

hasAttrBs -> False
hasAttrPy -> True
hasType -> <type 'NoneType'>

好的，<myId>不是<subdoc>的属性。

但是，如果存在子标签，我该如何测试？

//编辑：顺便说一下：我真的不喜欢通过整个子块进行迭代，因为这将非常慢。我希望找到一种可以直接解决/询问该元素的方法。

Answer 1

if tag.find('child_tag_name'):

Answer 2

如果您不知道XML文档的结构，可以使用汤的->createQueryBuilder() ->select("c.id, c.address, c.name") ->addSelect('(SELECT o.name FROM OOHMediaBundle:Offer o WHERE p.contractor_id = c.id ORDER BY o.created_at DESC LIMIT 1 OFFSET 0) as offer1') ->addSelect('(SELECT o.name FROM OOHMediaBundle:Offer o WHERE p.contractor_id = c.id ORDER BY o.created_at DESC LIMIT 1 OFFSET 1) as offer2')方法。像这样：

.find()

如果您确实知道结构，可以通过访问标记名称作为此with open("file1.xml",'r') as data, open("file2.xml",'r') as data2: xml = BeautifulSoup(data.read()) xml2 = BeautifulSoup(data2.read()) hasAttrBs = xml.find("myId") hasAttrBs2 = xml2.find("myId")之类的属性来获取所需元素。所以整件事情会是这样的：

xml.document.subdoc.myid

打印

with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
    xml = BeautifulSoup(data.read())
    xml2 = BeautifulSoup(data2.read())

    hasAttrBs = xml.document.subdoc.myid
    hasAttrBs2 = xml2.document.subdoc.myid
    print hasAttrBs
    print hasAttrBs2

Answer 3

你可以像这样处理它：

for child in xml.document.subdoc.children:
    if 'myId' == child.name:
       return True

Answer 4

这是一个检查Instagram URL中是否存在h2标签的示例。希望你觉得它很有用：

import datetime
import urllib
import requests
from bs4 import BeautifulSoup

instagram_url = 'https://www.instagram.com/p/BHijrYFgX2v/?taken-by=findingmero'
html_source = requests.get(instagram_url).text
soup = BeautifulSoup(html_source, "lxml")

if not soup.find('h2'):
    print("didn't find h2")

Answer 5

您可以使用if tag.myID:

如果您要检查myID是否是直系子女，而不是孩子的子女if tag.find("myID", recursive=False):

如果要检查标签是否没有子代，请使用if tag.find(True):

Answer 6

page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
soup = BeautifulSoup(page.content, 'html.parser')
testNode = list(soup.children)[1]

def hasChild(node):
    print(type(node))
    try:
        node.children
        return True
    except:
        return False

 if( hasChild(testNode) ):
     firstChild=list(testNode.children)[0]
     if( hasChild(firstChild) ):
        print('I found Grand Child ')

测试beautifulsoup中是否存在子标签

6 个答案: