如何用python beautifulsoup解析下面的代码?我需要为每个图像提供相应的宽度和高度属性(如果它们存在)。
下面的代码“表示此页面上有3个图像,第一个图像为300x300,中间的图像尺寸未指定,最后一个图像为1000px高”(如here所述)
<meta property="og:image" content="http://example.com/rock.jpg" />
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:image" content="http://example.com/rock2.jpg" />
<meta property="og:image" content="http://example.com/rock3.jpg" />
<meta property="og:image:height" content="1000" />
到目前为止,我有以下代码,但它只返回第一组维度:
images = []
img_list = soup.findAll('meta', {"property":'og:image'})
for og_image in img_list:
if not og_image.get('content'):
continue
image = {'url': og_image['content']}
width = self.soup.find('meta', {"property":'og:image:width'})
if width:
image['width'] = width['content']
height = self.soup.find('meta', {"property":'og:image:height'})
if width:
image['height'] = height['content']
images.append(image)
谢谢!
答案 0 :(得分:2)
这不是BeautifulSoup,但是一种pyparsing方法可以很快地拼凑起来:
html = """
<meta property="og:image" content="http://example.com/rock.jpg" />
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:image" content="http://example.com/rock2.jpg" />
<meta property="og:image" content="http://example.com/rock3.jpg" />
<meta property="og:image:height" content="1000" />
"""
from pyparsing import makeHTMLTags, withAttribute, Optional, Group
# use makeHTMLTags to define tag expressions (allows attributes, whitespace,
# closing '/', etc., and sets up results names for matched attributes so they
# are easy to get at later)
meta,metaEnd = makeHTMLTags("meta")
# define a copy of the opening tag, filtering on the specific attribute to select for
img_meta = meta.copy().setParseAction(withAttribute(('property','og:image')))
wid_meta = meta.copy().setParseAction(withAttribute(('property','og:image:width')))
hgt_meta = meta.copy().setParseAction(withAttribute(('property','og:image:height')))
# now define the overall expression to look for, and assign names for subexpressions
# for width and height
img_ref = img_meta + Optional(Group(wid_meta)("width")) + Optional(Group(hgt_meta)("height"))
# use searchString to scan through the given text looking for matches
for img in img_ref.searchString(html):
print "IMAGE:", img.content
if img.height:
print "H:", img.height.content
if img.width:
print "W:", img.width.content
print
打印:
IMAGE: http://example.com/rock.jpg
H: 300
W: 300
IMAGE: http://example.com/rock2.jpg
IMAGE: http://example.com/rock3.jpg
H: 1000
答案 1 :(得分:2)
我想要快速的东西,它使用beautifulsoup树结构。这是我认为合适的解决方案,以防有人寻找类似的东西:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup(html)
images = []
image = {}
img_list = soup.findAll('meta', {"property":'og:image'})
for og_image in img_list:
if not og_image.get('content'):
continue
image = {'url': og_image['content']}
next = og_image.nextSibling.nextSibling # calling once returns end of line char '\n'
if next and isinstance(next, Tag) and next.get('property', '').startswith('og:image:'):
dimension = next['content']
prop = next.get('property').rsplit(':')[-1]
image[prop] = dimension
next = next.nextSibling.nextSibling
if next and isinstance(next, Tag) and next.get('property', '').startswith('og:image:'):
dimension = next['content']
prop = next.get('property').rsplit(':')[-1]
image[prop] = dimension
images.append(image)
答案 2 :(得分:0)
你的不是解析问题,而是处理问题的列表。 您想要“分组”这样的列表:
[u'http://example.com/rock.jpg', u'300', u'300', u'http://example.com/rock2.jpg', u'http://example.com/rock3.jpg', u'1000']
这样的事情:
[[u'http://example.com/rock.jpg', u'300', u'300'], [u'http://example.com/rock2.jpg'], [u'http://example.com/rock3.jpg', u'1000']]
这是我的解决方案:
import BeautifulSoup as BS
content = '''<meta property="og:image" content="http://example.com/rock.jpg"
<meta property="og:image:width" content="300" />
<meta property="og:image:height" content="300" />
<meta property="og:image" content="http://example.com/rock2.jpg" />
<meta property="og:image" content="http://example.com/rock3.jpg" />
<meta property="og:image:height" content="1000" />'''
soup = BS.BeautifulSoup(content)
data = [m['content'] for m in soup.findAll('meta')]
# Grouping
images = []
current_image = None
for d in data:
if d.startswith('http'):
if current_image:
images.append(current_image)
current_image = [d]
else:
if current_image:
current_image.append(d)
else:
raise Exception('error')
images.append(current_image)
print data
print images