迭代从selenium到bs4的链接并打印剥离的字符串

时间:2017-01-05 16:04:00

标签: python python-3.x selenium selenium-webdriver bs4

意图:

1.使用Selenium访问http://blogdobg.com.br/的主页面。

2.识别文章链接

3.将每个链接插入bs4并拉动文本

问题: 我可以打印所有链接或将单个链接移动到bs4 用于解析和打印。我尝试阅读每个链接的次数在同一个链接中重复多次迭代。

我刚刚开始在两天前学习自己,所以任何指针都会受到赞赏。

from selenium import webdriver
from lxml import html
import requests
import re
from bs4 import BeautifulSoup

def read (html):
    html = browser.page_source
    soup = BeautifulSoup(html,"html.parser")
    for string in soup.article.stripped_strings:
            print(repr(string))

path_to_chromedriver = '/Users/yakir/chromedriver' 
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

url = 'http://blogdobg.com.br/'
browser.get(url)

articles = browser.find_elements_by_xpath("""//*[contains(concat( " ", @class, " " ), concat( " ", "entry-title", " " ))]//a""")

#get all the links
for link in articles:
    link.get_attribute("href")

#Attempt to print striped string from each link's landing page
for link in articles:
        read(link.get_attribute("href"))

##method for getting one link to work all the way through (currently commented out)
#article1 = articles[1].get_attribute("href")
#browser.get(article1)
#read(article1)

1 个答案:

答案 0 :(得分:0)

首先,当您在此函数中直接定义read()变量时,您的函数html具有html参数。这没有任何意义:无论如何你的论点都会被忽略,BeautifulSoup(html,"html.parser")将从html = browser.page_source获得价值,但不会从参数html

获得价值

另一个问题:您未获得所有链接

for link in articles:
    link.get_attribute("href")

您应该使用list并在每次迭代时附加值:

link_list = []
for link in articles:
    link_list.append(link.get_attribute("href"))

然后您可以使用以下链接:

for link in link_list:
    r = requests.get(link)
    ...
    # do whatever you want to do with response