模拟单击链接内的链接 - Selenium Python

时间:2017-01-14 04:17:33

标签: python selenium web-scraping

Python知识:初学者

我设法创建了一个脚本来抓取联系信息。我从初学者开始遵循的流程是提取所有第一个链接并将其复制到文本文件中,并在 link = browser.find_element_by_link_text(str(link_text))中使用搜索联系人详细信息已经确认工作(基于我的单独运行)。问题是,点击第一个链接后,它不会再点击其中的链接,因此无法抓取联系信息。

我的剧本出了什么问题?请记住我是初学者,所以我的脚本有点手工和冗长。 非常感谢!!!

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium.common.exceptions import NoSuchElementException

import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv, time, lxml

######################### open file list ####################################
testfile = open("category.txt") # this is where I saved the category
readfile = testfile.read()
readfilesplit = readfile.split("\n")
############################### end ###################################

################### open browser ###############################
browser = webdriver.Firefox()
browser.get('http://aucklandtradesmen.co.nz/')
####################### end ###################################

link_texts = readfilesplit
for link_text in link_texts:

        link = browser.find_element_by_link_text(str(link_text))
        WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))

        link.click() #click link
        time.sleep(5)

        print "-------------------------------------------------------------------------------------------------"
        print("Getting listings for '%s'" % link_text)

################# get list name #######################
        urlNoList = 'http://aucklandtradesmen.co.nz/home-mainmenu-1.html'
        r = requests.get(browser.current_url)

        if (urlNoList != browser.current_url):
            soup = BeautifulSoup(r.content, 'html.parser')

            g_data = soup.find_all("div", {"class":"listing-summary"})
            pageRange = soup.find_all("span", {"class":"xlistings"})

            pageR = [pageRange[0].text]
            pageMax = str(pageR)[-4:-2] # get max item for lists

            X = str(pageMax).replace('nd', '0')
            # print "Number of listings: ", X
            Y  = int(X) #convert string to int
            print "Number of listings: ", Y

            for item in g_data:
                try:
                    listingNames = item.contents[1].text
                    lstList = []
                    lstList[len(lstList):] = [listingNames]

                    replStr = re.sub(r"u'",  "'",str(lstList)) #strip u' char

                    replStr1 = re.sub(r"\s+'",  "'",str(replStr)) #strip space and '

                    replStr2 = re.sub(r"\sFeatured",  "",str(replStr1)) #strip Featured string
                    print "Cleaned string: ", replStr2

                    ################ SCRAPE INFO ################
################### This is where the code is not executing #######################
                    count = 0
                    while (count < Y):
                        for info in replStr2:
                            link2 = browser.find_element_by_link_text(str(info))
                            time.sleep(10)
                            link2.click()
                            WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#rating-msg")))
                            print "count", count
                            count+= 1
                            print("Contact info for: '%s'" % link_text)

                            r2 = requests.get(browser.current_url)

                            soup2 = BeautifulSoup(r2.content, 'html.parser')

                            g_data2 = soup.find_all("div", {"class":"fields"})

                            for item2 in g_data2:
                            # print item.contents[0]
                                print item2.contents[0].text
                                print item2.contents[1].text
                                print item2.contents[2].text
                                print item2.contents[3].text
                                print item2.contents[4].text
                                print item2.contents[5].text
                                print item2.contents[6].text
                                print item2.contents[7].text
                                print item2.contents[8].text

                    browser.back()
                    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
################### END ---- This is where the code is not executing END ---#######################
                    ############ END SCRAPE INFO ####################
                except NoSuchElementException:
                    browser.back()
                    WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))

        else:
            browser.back()
            WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
            print "Number of listings: 0"

        browser.back()
        WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))

顺便说一下,这是一些结果:

-------------------------------------------------------------------------------------------------
Getting listings for 'Plumbers'
Number of listings:  5
Cleaned string:  ['Hydroflame Plumbing & Gas Ltd']
Cleaned string:  ['Osborne Plumbing Ltd']
Cleaned string:  ['Plumbers Auckland Central']
Cleaned string:  ['Griffiths Plumbing']
Cleaned string:  ['Plumber Auckland']
-------------------------------------------------------------------------------------------------
Getting listings for 'Professional Services'
Number of listings:  2
Cleaned string:  ['North Shore Chiropractor']
Cleaned string:  ['Psychotherapy Werks - Rob Hunter']
-------------------------------------------------------------------------------------------------
Getting listings for 'Property Maintenance'
Number of listings:  7
Cleaned string:  ['Auckland Tree Services']
Cleaned string:  ['Bob the Tree Man']
Cleaned string:  ['Flawless House Washing & Drain Unblocking']
Cleaned string:  ['Yardiez']
Cleaned string:  ['Build Corp Apartments Albany']
Cleaned string:  ['Auckland Trellis']
Cleaned string:  ['Landscape Design']

2 个答案:

答案 0 :(得分:0)

我要做的是改变一些逻辑。这是我建议您使用的逻辑流程。这将消除链接的注销并加速脚本。

1. Navigate to http://aucklandtradesmen.co.nz/
2. Grab all elements using CSS selector "#index a" and store the attribute "href" of each
   in an array of string (links to each category page)
3. Loop through the href array
   3.1. Navigate to href
        3.1.1. Grab all elements using CSS selector "div.listing-summary a" and store the
               .text of each (company names)
        3.1.2. If an element .by_link_text("Next") exists, click it and return to 3.1.1.

如果您想要公司页面上的业务联系信息,您可能希望将href存储在3.1.1中。然后遍历该列表并从页面中获取您想要的内容。

很抱歉列表的格式很奇怪。它不会让我缩进多个级别。

答案 1 :(得分:0)

好吧,在想到@ jeffC的建议之后我找到了一个解决方案:

  • 提取href值并将其附加到 http://aucklandtradesmen.co.nz 的基本网址,例如,如果提取的href是 / home-mainmenu-1 / alarms- a-security / armed-alarms-ltd-.html ,并告诉浏览器导航到该URL ..然后我可以在当前页面中做任何我想做的事情..