Python在div标签下获取href

时间:2020-09-19 08:57:48

标签: python selenium selenium-webdriver selenium-chromedriver screen-scraping

我正在使用Python创建一个工具,该工具可以下载用户在Instagram上发布的所有照片和视频。我现在要做的就是提取所有帖子的链接,以便可以循环运行它们并下载所有帖子。

enter image description here

这是我想提取的href。我在堆栈溢出时尝试了所有解决方案,但没有任何结果。这就是我问这个问题的原因。这是我的代码供您参考:

import urllib.request as reqq
from selenium import webdriver

url = input("Enter the link:")

browser = webdriver.Chrome("D:\\Python_Files\\Programs\\chromedriver.exe")

browser.get(url)

#URLs of all posts are extracted. This is whr I need help

for x in range(len(extracted_urls)):
    img_url = ""
    vid_url = ""
    
    try:
        
        vid_url = browser.find_element_by_class_name('_5wCQW').find_element_by_tag_name('video').get_attribute('src')
        reqq.urlretrieve(vid_url,f"D:\\instavid{x}.mp4")    
        
    except: 
        
        img_url = browser.find_element_by_class_name('KL4Bh').find_element_by_tag_name('img').get_attribute('src')
        reqq.urlretrieve(img_url,f"D:\\instaimg{x}.jpg")
    
browser.close()

3 个答案:

答案 0 :(得分:2)

您正在使用类名称来标识元素,但这是以编程方式生成的,将无法使用。为了报废它,您可以使用如下所示的css选择器

 list = driver.find_elements_by_css_selector('article > div a')
 for element in list:
    print(element.get_attribute('href')) #this will give you all the urls

答案 1 :(得分:1)

尝试使用Beautifulsoup。您可以轻松解析html和xml。

from bs4 import BeautifulSoup

data = '<div><div><a href="/p/CFShhjj"></a></div></div>'
soup = BeautifulSoup(data, 'html.parser')
for tag in soup.find_all():
if tag.name=='a':
    print(tag['href'])


OUT : /p/CFShhjj

答案 2 :(得分:1)

您在页面上看到的数据是通过编程方式生成的,并存储在页面内的JSON中。使用此示例提取媒体数据:

import re
import json
import requests


url = 'https://www.instagram.com/cristiano/'
html_doc = requests.get(url).text
data = json.loads(re.search(r'window\._sharedData = ({.*?});', html_doc).group(1))

def find_media(data):
    if isinstance(data, dict):
        for k, v in data.items():
            if k == '__typename' and v in ('GraphImage', 'GraphVideo'): 
                yield data
            else:
                yield from find_media(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_media(v)

for media in find_media(data):
    print('http://instagram.com/p/{}/'.format(media['shortcode']))
    if media['__typename'] == 'GraphImage':
        print(media['display_url']) 
    else:
        print(media['video_url']) 

打印:

...

http://instagram.com/p/CE2D7zcAFrq/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/118883515_338753734147784_3257042213665207515_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=108&_nc_ohc=Duxo0dZ3q8oAX8ps1u6&_nc_tp=15&oh=e52995d9569e7dde5c348a5eb1c4a886&oe=5F8F1B14
http://instagram.com/p/CE2D7zbAMrJ/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/119056544_318190815934349_3868576271600213484_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=105&_nc_ohc=Sfy0ykNdpxsAX9kOZqy&_nc_tp=15&oh=b48009b98e0a4b6483f79902d3253d12&oe=5F8F11FB
http://instagram.com/p/CEy7I6yAm9i/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/118877779_2707314466207997_7737960511758007253_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=prF2yBSpK34AX8TSiuX&_nc_tp=15&oh=bdfc5b9d1914bdd3470209a64a5e155b&oe=5F8DEBFF
http://instagram.com/p/CEy7I6yg2Op/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/118784068_748127092633181_2341530667249985288_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=105&_nc_ohc=_ynvOYOBxY4AX_B22yF&_nc_tp=15&oh=e5e9d716421cd3ea0edb71adeebe8ebe&oe=5F8F3798
http://instagram.com/p/CEy7I6zA6nU/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/118555638_778542006213962_8711737455993781057_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=102&_nc_ohc=MVPAjHvN3QkAX9uawMn&_nc_tp=15&oh=cefd5a6162af11327ab9d1d4bf94df7a&oe=5F8FCF5B
http://instagram.com/p/CEy7I63Aq9f/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/118782135_760435874772485_2807641115290436245_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=105&_nc_ohc=SUZnVsn1EU4AX8MUj30&_nc_tp=15&oh=7937a7c235a54b271a869596c837fae6&oe=5F907287
http://instagram.com/p/CEy7I64AvaZ/
https://instagram.fbts6-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/118651624_163760538669975_655651222517528584_n.jpg?_nc_ht=instagram.fbts6-1.fna.fbcdn.net&_nc_cat=103&_nc_ohc=8mj5Ysn75pMAX9dpSDA&_nc_tp=15&oh=6c33025592ee8f4569ad5d016e69785b&oe=5F8FF67B