Question

我想在Redfin网站上抓取一些图片，但似乎FindAll（）方法无法找到其父类为ImageCard的所有图片网址。

以下是代码：

from bs4 import BeautifulSoup
import urllib2

def make_soup(url):
 headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
 req = urllib2.Request(url, headers=headers)
 thepage = urllib2.urlopen(req).read()
 soupdata = BeautifulSoup(thepage, "html.parser")
 return soupdata

 soup = make_soup("https://www.redfin.com/CA/San-Diego/5747-Adobe-Falls-Rd-92120/unit-A/home/5437025")

 imgcards = soup.findAll('div', {'class': 'ImageCard'})
 for imgcard in imgcards:
 img = imgcard.findAll('img')
 print(img['src'])

I want to download all the images in this slide on the web page

元素树是： elements tree of webpage

我可以找到幻灯片的第一张图片的div。希望有人能搞清楚！谢谢！

Answer 1

html不包含这些额外照片的链接。这就是为什么你找不到它。它们是使用javascript创建的，您的程序不会处理javascript。

但是，如果你仔细观察，你会发现：

<meta content="http://media.cdn-redfin.com/photo/48/bigphoto/983/160048983_0.jpg" name="twitter:image:src">

这是第一张照片的alt网址。

第二张图片的网址是：

https://ssl.cdn-redfin.com/photo/48/bigphoto/983/160048983_1_0.jpg

url to 3rd：

https://ssl.cdn-redfin.com/photo/48/bigphoto/983/160048983_2_0.jpg

你可以利用这个来获得你想要的东西（你可以根据第一个来猜出额外图片的网址）。

Beautifulsoup findall（）无法找到所有目标

1 个答案: