我正在使用Selenium创建一个简单的程序,转到Flickr.com,搜索用户输入的术语,然后打印出所有这些图像的URL。
我在最后一部分挣扎,只获得图片的网址。我一直在使用class_=
搜索来获取HTML所在的HTML部分。在搜索' apples':
<div class="view photo-list-photo-view requiredToShowOnServer awake"
data-view-signature="photo-list-photo-view__engagementModelName_photo-lite-
models__excludePeople_false__id_6246270647__interactionViewName_photo-list-
photo-interaction- view__isOwner_false__layoutItem_1__measureAFT_true__model_1__modelParams_1_ _parentContainer_1__parentSignature_photolist-
479__requiredToShowOnClient_true__requiredToShowOnServer_true__rowHeightMod _1__searchTerm_apples__searchType_1__showAdvanced_true__showSort_true__show Tools_true__sortMenuItems_1__unifiedSubviewParams_1__viewType_jst"
style="transform: translate(823px, 970px); -webkit-transform: translate(823px, 970px); -ms-transform: translate(823px, 970px); width:
237px; height: 178px; background-image:
url(//c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg)">
<div class="interaction-view"></div>
我想要的只是每张图片的网址如下:
c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg
由于没有a
或href
个错误,我很难将其过滤掉。
我最后还尝试了一些正则表达式,如下所示:
print(soup.find_all(re.compile(r'^url\.jpg$')))
但那不起作用。
以下是我的完整代码,谢谢。
import os
import re
import urllib.request as urllib2
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
os.makedirs('My_images', exist_ok=True)
browser = webdriver.Chrome()
browser.implicitly_wait(10)
print("Opening Flickr.com")
siteChoice = 'http://www.flickr.com'
browser.get(siteChoice)
print("Enter your search term: ")
term = input("> ")
searchField = browser.find_element_by_id('search-field')
searchField.send_keys(term)
searchField.submit()
url = siteChoice + '/search/?text=' + term
html = urllib2.urlopen(url)
soup = bs4.BeautifulSoup(html, "html.parser")
print(soup.find_all(class_='view photo-list-photo-view requiredToShowOnServer awake', style = re.compile('staticflickr')))
我改变了代码:
p = re.compile(r'url\(\/\/([^\)]+)\)')
test_str = str(soup)
all_urls = re.findall(p, test_str)
print('Exporting to file')
with open('flickr_urls.txt', 'w') as f:
for i in all_urls:
f.writelines("%s\n" % i)
print('Done')
答案 0 :(得分:2)
试试这个
url\(\/\/([^\)]+)\)
import re
p = re.compile(ur'url\(\/\/([^\)]+)\)')
test_str = u"<div class=\"view photo-list-photo-view requiredToShowOnServer awake\" \ndata-view-signature=\"photo-list-photo-view__engagementModelName_photo-lite-\nmodels__excludePeople_false__id_6246270647__interactionViewName_photo-list-\nphoto-interaction- view__isOwner_false__layoutItem_1__measureAFT_true__model_1__modelParams_1_ _parentContainer_1__parentSignature_photolist-\n479__requiredToShowOnClient_true__requiredToShowOnServer_true__rowHeightMod _1__searchTerm_apples__searchType_1__showAdvanced_true__showSort_true__show Tools_true__sortMenuItems_1__unifiedSubviewParams_1__viewType_jst\"\n style=\"transform: translate(823px, 970px); -webkit-transform: translate(823px, 970px); -ms-transform: translate(823px, 970px); width:\n 237px; height: 178px; background-image:\n url(//c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg)\">\n<div class=\"interaction-view\"></div>"
m = re.search(p, test_str)
print m.group(1)
输出:
c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg
答案 1 :(得分:2)
使用Selenium从页面中删除所有png / jpg链接:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.flickr.com/")
links = driver.execute_script("return document.body.innerHTML.match(" \
"/https?:\/\/[a-z_\/0-9\-\#=&.\@]+\.(jpg|png)/gi)")
print links