我正在尝试从xml页面中提取带有特定关键字的image:title标签。如果我只是在loc标签上搜索,这些关键字就可以正常工作。下面的代码
print("Searching for product...")
keywordLinkFound = False
while keywordLinkFound is False:
html = self.driver.page_source
soup = BeautifulSoup(html, 'xml')
try:
regexp = "%s.*%s|%s.%s" % (keyword1, keyword2, keyword2, keyword1)
keywordLink = soup.find('image:title', text=re.compile(regexp))
print(keywordLink)
return keywordLink
except AttributeError:
print("Product not found on site, retrying...")
time.sleep(monitorDelay)
self.driver.refresh()
break
这是即时解析的xml代码:
<url>
<loc>
https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
<image:loc>
https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
</image:loc>
<image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>
似乎我无法访问image:title标签
答案 0 :(得分:0)
这会在<image:title>
中查找文本:
soup.findAll('image')[0].findAll('title')[0].text
或者您可以
soup.image.title.text
输出:
'ADIDAS YUNG-1 "CLOUD WHITE"'
您应该使用BeautifulSoup
(documentation)中的内置方法代替正则表达式。使用BeatifulSoup
来解析HTML
的好处是您可以利用语言的结构形式。
修改
这是完整的工作代码:
from bs4 import BeautifulSoup
html = """
<url>
<loc>
https://packershoes.com/products/copy-of-adidas-predator-accelerator-trainer
</loc>
<lastmod>2018-11-24T08:22:42-05:00</lastmod>
<changefreq>daily</changefreq>
<image:image>
<image:loc>
https://cdn.shopify.com/s/files/1/0208/5268/products/adidas_Yung-1_B37616_side.jpg?v=1537395620
</image:loc>
<image:title>ADIDAS YUNG-1 "CLOUD WHITE"</image:title>
</image:image>
</url>
"""
soup = BeautifulSoup(html, 'xml')
soup.image.title.text
输出:
'ADIDAS YUNG-1 "CLOUD WHITE"'