我有一个XML文件,其中包含已定义的结构但标签数量不同,例如
file1.xml:
<document>
<subDoc>
<id>1</id>
<myId>1</myId>
</subDoc>
</document>
file2.xml:
<document>
<subDoc>
<id>2</id>
</subDoc>
</document>
现在我想查看标签myId
是否退出。所以我做了以下事情:
data = open("file1.xml",'r').read()
xml = BeautifulSoup(data)
hasAttrBs = xml.document.subdoc.has_attr('myID')
hasAttrPy = hasattr(xml.document.subdoc,'myID')
hasType = type(xml.document.subdoc.myid)
结果是 file1.xml:
hasAttrBs -> False
hasAttrPy -> True
hasType -> <class 'bs4.element.Tag'>
file2.xml:
hasAttrBs -> False
hasAttrPy -> True
hasType -> <type 'NoneType'>
好的,<myId>
不是<subdoc>
的属性。
但是,如果存在子标签,我该如何测试?
//编辑:顺便说一下:我真的不喜欢通过整个子块进行迭代,因为这将非常慢。我希望找到一种可以直接解决/询问该元素的方法。
答案 0 :(得分:18)
if tag.find('child_tag_name'):
答案 1 :(得分:4)
如果您不知道XML文档的结构,可以使用汤的->createQueryBuilder()
->select("c.id, c.address, c.name")
->addSelect('(SELECT o.name FROM OOHMediaBundle:Offer o WHERE p.contractor_id = c.id ORDER BY o.created_at DESC LIMIT 1 OFFSET 0) as offer1')
->addSelect('(SELECT o.name FROM OOHMediaBundle:Offer o WHERE p.contractor_id = c.id ORDER BY o.created_at DESC LIMIT 1 OFFSET 1) as offer2')
方法。像这样:
.find()
如果您确实知道结构,可以通过访问标记名称作为此with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
xml = BeautifulSoup(data.read())
xml2 = BeautifulSoup(data2.read())
hasAttrBs = xml.find("myId")
hasAttrBs2 = xml2.find("myId")
之类的属性来获取所需元素。所以整件事情会是这样的:
xml.document.subdoc.myid
打印
with open("file1.xml",'r') as data, open("file2.xml",'r') as data2:
xml = BeautifulSoup(data.read())
xml2 = BeautifulSoup(data2.read())
hasAttrBs = xml.document.subdoc.myid
hasAttrBs2 = xml2.document.subdoc.myid
print hasAttrBs
print hasAttrBs2
答案 2 :(得分:1)
for child in xml.document.subdoc.children:
if 'myId' == child.name:
return True
答案 3 :(得分:1)
这是一个检查Instagram URL中是否存在h2标签的示例。希望你觉得它很有用:
import datetime
import urllib
import requests
from bs4 import BeautifulSoup
instagram_url = 'https://www.instagram.com/p/BHijrYFgX2v/?taken-by=findingmero'
html_source = requests.get(instagram_url).text
soup = BeautifulSoup(html_source, "lxml")
if not soup.find('h2'):
print("didn't find h2")
答案 4 :(得分:1)
您可以使用if tag.myID:
如果您要检查myID
是否是直系子女,而不是孩子的子女if tag.find("myID", recursive=False):
如果要检查标签是否没有子代,请使用if tag.find(True):
答案 5 :(得分:0)
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
soup = BeautifulSoup(page.content, 'html.parser')
testNode = list(soup.children)[1]
def hasChild(node):
print(type(node))
try:
node.children
return True
except:
return False
if( hasChild(testNode) ):
firstChild=list(testNode.children)[0]
if( hasChild(firstChild) ):
print('I found Grand Child ')