您好,对于主题模糊的帖子很抱歉,但是我正在练习使用硒进行网络抓取。我有一个链接列表“ urls_to_scrape”,对于每个要访问的链接并提取某些元素的URL,我已经能够提取每个元素,但是现在我很困惑如何为列表中的每个URL执行此操作。请参见下面的代码。
urls_to_scrape # list containing urls I want to perform the code below for
# each url
results = []
articles = driver.find_elements_by_css_selector('#MainW article')
counter = 1
for article in articles:
result = {}
try:
title = article.find_element_by_css_selector('a').text
except:
continue
counter = counter + 1
excerpt = article.find_element_by_css_selector('div > div > p').text
author =
article.find_element_by_css_selector('div > footer > address > a').text
date = article.find_element_by_css_selector('div > footer > time').text
link=
article.find_element_by_css_selector('div>h2>a').get_attribute('href')
result['title'] = title
result['excerpt'] = excerpt
result['author'] = author
result['date'] = date
result['link'] = link
results.append(result)
答案 0 :(得分:0)
我认为您有缩进问题。试试这个:
urls_to_scrape # list containing urls I want to perform the code below for
# each url
results = []
articles = driver.find_elements_by_css_selector('#MainW article')
counter = 1
for article in articles:
result = {}
try:
title = article.find_element_by_css_selector('a').text
except:
continue
counter = counter + 1
excerpt = article.find_element_by_css_selector('div > div > p').text
author = article.find_element_by_css_selector('div > footer > address > a').text
date = article.find_element_by_css_selector('div > footer > time').text
link = article.find_element_by_css_selector('div>h2>a').get_attribute('href')
result['title'] = title
result['excerpt'] = excerpt
result['author'] = author
result['date'] = date
result['link'] = link
results.append(result)
driver
是什么?您尚未提供获取网址的行。此行对于获取多个URL也是至关重要的。
答案 1 :(得分:0)
使函数执行抓取(结果以下的所有内容= []),例如
def scrape(url):
...
...
return result
然后
for url in url_to_scrape:
result = scrape(url)
results.append(result)