Question

我正在尝试从xml页面中提取带有特定关键字的image：title标签。如果我只是在loc标签上搜索，这些关键字就可以正常工作。下面的代码

print("Searching for product...")
        keywordLinkFound = False
        while keywordLinkFound is False:
            html = self.driver.page_source
            soup = BeautifulSoup(html, 'xml')
            try:
                regexp = "%s.*%s|%s.%s" % (keyword1, keyword2, keyword2, keyword1)
                keywordLink = soup.find('image:title', text=re.compile(regexp))
                print(keywordLink)
                return keywordLink
            except AttributeError:
                print("Product not found on site, retrying...")
                time.sleep(monitorDelay)
                self.driver.refresh()
            break

这是即时解析的xml代码：

<url>
<loc>
   https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
    <image:loc>
    https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
    </image:loc>
    <image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>

似乎我无法访问image：title标签

Answer 1

这会在<image:title>中查找文本：

soup.findAll('image')[0].findAll('title')[0].text

或者您可以

soup.image.title.text

输出：

'ADIDAS YUNG-1 "CLOUD WHITE"'

您应该使用BeautifulSoup（documentation）中的内置方法代替正则表达式。使用BeatifulSoup来解析HTML的好处是您可以利用语言的结构形式。

修改

这是完整的工作代码：

from bs4 import BeautifulSoup

html = """
<url>
<loc>
   https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
    <image:loc>
    https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
    </image:loc>
    <image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>
"""

soup = BeautifulSoup(html, 'xml')
soup.image.title.text

输出：

'ADIDAS YUNG-1 "CLOUD WHITE"'

无法使用漂亮的汤和python查找xml标签

1 个答案: