使用BeautifulSoup,Python

时间:2018-01-12 11:26:30

标签: python beautifulsoup pinterest

我试图提取pinterest数据,如图钉标题,图像描述,如alt,src,评论/描述,创建者。由于我的api尚未获得批准,我尝试使用BeautifulSoup,Python进行网页抓取。我可以看到的限制是,无论我使用哪个查询关键字,它都只检索16 alt,src。如何克服这个限制并开始提取至少100个数据?以下是一个示例代码段。我期待着收到你的来信。非常感谢!

import requests
from bs4 import BeautifulSoup
import pandas as pd

var = "analytics"
URL = "https://in.pinterest.com/search/pins/?q=" + var

r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')
alt = []
src = []

for link in soup.find_all('img'):
    alt.append(link.get('alt'))
    src.append(link.get('src'))

1 个答案:

答案 0 :(得分:0)

包括 Pinterest 在内的许多页面通过在使用前不加载内容来节省带宽并改善用户体验。

为了解决这个问题,我们将 seleniumBeautifulSoup 结合起来:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

#  Any infinity scroll URL
var = "analytics"
url = "https://in.pinterest.com/search/pins/?q=" + var 
ScrollNumber = 4  # The depth we wish to load
sleepTimer = 1    # Waiting 1 second for page to load

#  Bluetooth bug circumnavigate
options = webdriver.ChromeOptions() 
options.add_experimental_option("excludeSwitches", ["enable-logging"])

driver = webdriver.Chrome(options=options)  # path=r'to/chromedriver.exe'
driver.get(url)

for _ in range(1,ScrollNumber):
    driver.execute_script("window.scrollTo(1,100000)")
    print("scrolling")
    time.sleep(sleepTimer)

soup = BeautifulSoup(driver.page_source,'html.parser')

for link in soup.find_all('img'):
    print(link.get('src'))
    #print(link.get('alt'))

如果您在脚本文件夹中包含 chromedriver.exe,则以下脚本提供了一个简单的开始。