我正在解析。我想在description标签中获取图像。我正在使用urllib和BeautifulSoup。我可以获取单独标签内的图像,但我无法以编码格式获取描述标签内的图像。
Xml代码
<item>
<title>Kidnapped NDC member and political activist tells his story</title>
<link>http://www.yementimes.com/en/1724/news/3065</link>
<description><img src="http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg" border="0" align="left" hspace="5" />
‘I kept telling them that they would never break me and that the change we demanded in 2011 would come whether they wanted it or not’
<br clear="all"></description>
views.py
for q in b.findAll('item'):
d={}
d['desc']=strip_tags(q.description.string).strip(' ')
if q.guid:
d['link']=q.guid.string
else:
d['link']=strip_tags(q.comments)
d['title']=q.title.string
for r in q.findAll('enclosure'):
d['image']=r['url']
arr.append(d)
任何人都可以给我一个想法吗?
这就是我在单独的标签内解析图像所做的...
如果它在内部描述我试图得到,但我不能。
答案 0 :(得分:0)
您可以尝试从<description>
中提取所有内容,使用它创建新的BeautifulSoup
对象,并搜索第一个src
元素的<img>
属性:
from bs4 import BeautifulSoup
import sys
import html.parser
h = html.parser.HTMLParser()
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for i in soup.find_all('item'):
d = BeautifulSoup(h.unescape(i.description.string))
print(d.img['src'])
像以下一样运行:
python3 script.py xmlfile
产量:
http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg