使用Beautifulsoup在xml文件的description标签中提取img

时间:2013-10-30 06:38:50

标签: python xml beautifulsoup

我正在解析。我想在description标签中获取图像。我正在使用urllib和BeautifulSoup。我可以获取单独标签内的图像,但我无法以编码格式获取描述标签内的图像。

Xml代码

<item>
         <title>Kidnapped NDC member and political activist tells his story</title>
         <link>http://www.yementimes.com/en/1724/news/3065</link>
         <description>&lt;img src="http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg" border="0" align="left" hspace="5" /&gt;
‘I kept telling them that they would never break me and that the change we demanded in 2011 would come whether they wanted it or not’
&lt;br clear="all"&gt;</description>

views.py

for q in b.findAll('item'):
            d={}
            d['desc']=strip_tags(q.description.string).strip('&nbsp')
            if q.guid:
                d['link']=q.guid.string
            else:   
                d['link']=strip_tags(q.comments)
            d['title']=q.title.string
            for r in q.findAll('enclosure'):
                d['image']=r['url']
            arr.append(d)

任何人都可以给我一个想法吗?
这就是我在单独的标签内解析图像所做的... 如果它在内部描述我试图得到,但我不能。

1 个答案:

答案 0 :(得分:0)

您可以尝试从<description>中提取所有内容,使用它创建新的BeautifulSoup对象,并搜索第一个src元素的<img>属性:

from bs4 import BeautifulSoup
import sys 
import html.parser

h = html.parser.HTMLParser()

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')
for i in soup.find_all('item'):
    d = BeautifulSoup(h.unescape(i.description.string))
    print(d.img['src'])

像以下一样运行:

python3 script.py xmlfile

产量:

http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg