Question

我正在使用lxml来抓取特定页面。我知道如何通过id抓取标签，但无法找到如何获取实际的id属性。

例如说html是：

<div id="stuff" >
    <div id="some unknown"> xxxx </div>
    <div id="another unknown"> xxxxx </div>
</div>

如何获取列表

['some unknown', 'another unknown']

有没有办法专门使用xpath？

Answer 1

如果您想要直接孩子的id，您可以使用以下XPath查询：

#                                       v obtain id attribute
document.xpath('//*[@id="stuff"]/*[@id]/@id')
#                 ^ #stuff tag   ^ child with id attribute

我们首先在这里找一个<* id="stuff">代码，然后我们查找具有@id的直接子代（任何代码），并从这些@id中获取lxml.etree._ElementUnicodeResult。

这将返回str(..)个元素的列表。但是，我们可以使用[str(the_id) for the_id in document.xpath('//*[@id="stuff"]/*[@id]/@id')]来获取字符串值：

id

请注意，我们在此注意关注孩子的类型。如果您只想要<div>个# v obtain id attribute document.xpath('//*[@id="stuff"]/div[@id]/@id') # ^ #stuff tag ^ child with id attribute个孩子，则可以使用：

@id="stuff"

如果您查找所有后代，您只需在# v obtain id attribute document.xpath('//*[@id="stuff"]//*[@id]/@id') # ^ #stuff tag ^ descendant with id attribute查询与子项之间添加其他斜杠：

k=4
for i in $(seq 1 1 ${k})
do 
    name="0"
    if [[ "$i" -eq "1" ]]; then
        name="1"
    fi

    for j in $(seq 2 1 ${k})
    do
        if [[ "$i" -eq "$j" ]]; then
            name="${name}_1"
        else
            name="${name}_0"
        fi
    done
    echo "$name" #to make directories replace with mkdir "$name"
done

如何获得div的所有孩子的身份

1 个答案: