意图:
1.使用Selenium访问http://blogdobg.com.br/的主页面。
2.识别文章链接
3.将每个链接插入bs4并拉动文本
问题: 我可以打印所有链接或将单个链接移动到bs4 用于解析和打印。我尝试阅读每个链接的次数在同一个链接中重复多次迭代。
我刚刚开始在两天前学习自己,所以任何指针都会受到赞赏。
from selenium import webdriver
from lxml import html
import requests
import re
from bs4 import BeautifulSoup
def read (html):
html = browser.page_source
soup = BeautifulSoup(html,"html.parser")
for string in soup.article.stripped_strings:
print(repr(string))
path_to_chromedriver = '/Users/yakir/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'http://blogdobg.com.br/'
browser.get(url)
articles = browser.find_elements_by_xpath("""//*[contains(concat( " ", @class, " " ), concat( " ", "entry-title", " " ))]//a""")
#get all the links
for link in articles:
link.get_attribute("href")
#Attempt to print striped string from each link's landing page
for link in articles:
read(link.get_attribute("href"))
##method for getting one link to work all the way through (currently commented out)
#article1 = articles[1].get_attribute("href")
#browser.get(article1)
#read(article1)
答案 0 :(得分:0)
首先,当您在此函数中直接定义read()
变量时,您的函数html
具有html
参数。这没有任何意义:无论如何你的论点都会被忽略,BeautifulSoup(html,"html.parser")
将从html = browser.page_source
获得价值,但不会从参数html
另一个问题:您未获得所有链接与
for link in articles:
link.get_attribute("href")
您应该使用list
并在每次迭代时附加值:
link_list = []
for link in articles:
link_list.append(link.get_attribute("href"))
然后您可以使用以下链接:
for link in link_list:
r = requests.get(link)
...
# do whatever you want to do with response