Question

<html>
    <body>
        <div class="root-div">
            <h1>This is H1</h1>
            <ul>UL</ul>
            <h2>This is H2</h2>
            <img src="www.ttttt.com.png">
            <ul>UL</ul>
            <a href="www.ttttt.com">
            <h3>This is H3</h3>
        </div>
    </body>
</htnl>

如果我知道所有标签，则可以获得所有信息。

response.css('div.root-div > h1::text').extract_first()
response.css('div.root-div > h2::text').extract_first()
response.css('div.root-div > a::attr(href)').extract_first()

如果我不知道<div class="root-div">??????</div>中的标签，请购买。如何获取每个文本。

例如

for tag in response.css('div.root-div ??????????'):
    if tag == "div":
       print("do something")
    else if tag == "img":
       print("do something")
    else:
       print("")

Answer 1

如果您需要了解每个子元素的标签，请执行以下操作：

for item in response.css('div.root-div *'):
    tag = item.root.tag
    if tag == 'div':
        # ...

但是，如果您确实只是想要子元素的文本，请执行以下操作：

for text in response.css('div.root-div ::text').getall():
    # ...

Scrapy forloop节点子节点

1 个答案: