Question

for imgsrc in Soup.findAll('img', {'class': 'sizedProdImage'}):
    if imgsrc:
        imgsrc = imgsrc
    else:
        imgsrc = "ERROR"

patImgSrc = re.compile('src="(.*)".*/>')
findPatImgSrc = re.findall(patImgSrc, imgsrc)

print findPatImgSrc

'''
<img height="72" name="proimg" id="image" class="sizedProdImage" src="http://imagelocation" />

这就是我想从中提取的内容，我得到了：

findimgsrcPat = re.findall(imgsrcPat, imgsrc)
File "C:\Python27\lib\re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

'''

Answer 1

还有更简单的解决方案：

 soup.find('img')['src']

Answer 2

您将beautifulsoup节点传递给re.findall。你必须将它转换为字符串。尝试：

findPatImgSrc = re.findall(patImgSrc, str(imgsrc))

更好的是，使用beautifulsoup提供的工具：

[x['src'] for x in soup.findAll('img', {'class': 'sizedProdImage'})]

给出了类'sizedProdImage'的img标签的所有src属性列表。

Answer 3

你正在创建一个re对象，然后将它传递给re.findall，它需要一个字符串作为第一个参数：

patImgSrc = re.compile('src="(.*)".*/>')
findPatImgSrc = re.findall(patImgSrc, imgsrc)

相反，请使用刚刚创建的patImgSrc对象的.findall方法：

patImgSrc = re.compile('src="(.*)".*/>')
findPatImgSrc = patImgSrc.findall(imgsrc)

Answer 4

在我的示例中，htmlText包含img标记，但它也可以用于URL。请参阅我的回答here

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

Python 2.7美丽的汤Img Src Extract

4 个答案: