面对使用python和selenium编写的twitter scraper问题

时间:2017-06-10 15:09:02

标签: python-3.x selenium twitter web-scraping web-crawler

我在python中编写了一个脚本,用于解析我在twitter的个人资料页面中查看所有部分的名称,推文,关注和关注者。它目前正在做它的工作。但是,我发现这个刮刀存在两个问题:

  1. 它解析文档的每个页面都在任务栏上干扰。
  2. 刮刀看起来很笨拙。
  3. 这是我写的:

    from selenium import webdriver
    import time
    
    def twitter_data():
    
        driver = webdriver.Chrome()
        driver.get('https://twitter.com/?lang=en')
    
        driver.find_element_by_xpath('//input[@id="signin-email"]').send_keys('username')
        driver.find_element_by_xpath('//input[@id="signin-password"]').send_keys('password')
        driver.find_element_by_xpath('//button[@type="submit"]').click()
        driver.implicitly_wait(15)
    
        #Clicking the viewall link
        driver.find_element_by_xpath("//small[@class='view-all']//a[contains(@class,'js-view-all-link')]").click()
        time.sleep(10)
    
        for links in driver.find_elements_by_xpath("//div[@class='stream-item-header']//a[contains(@class,'js-user-profile-link')]"):
            processing_files(links.get_attribute("href"))
            #going on to the each profile falling under viewall section
    def processing_files(item_link):
    
        driver = webdriver.Chrome()
        driver.get(item_link)
        # getting information of each profile holder
        for prof in driver.find_elements_by_xpath("//div[@class='route-profile']"):
            name = prof.find_elements_by_xpath(".//h1[@class='ProfileHeaderCard-name']//a[contains(@class,'ProfileHeaderCard-nameLink')]")[0]
            tweet = prof.find_elements_by_xpath(".//span[@class='ProfileNav-value']")[0]
            following = prof.find_elements_by_xpath(".//span[@class='ProfileNav-value']")[1]
            follower = prof.find_elements_by_xpath(".//span[@class='ProfileNav-value']")[2]
            print(name.text, tweet.text, following.text, follower.text)
    
    twitter_data()
    

    我在我的刮刀中使用了implicitly_wait和time.sleep,因为当我发现有必要让机器人等待一段时间后我使用后者。提前感谢您仔细研究。

1 个答案:

答案 0 :(得分:1)

您可以使用driver.quit()关闭页面,如下所示。这将减少任务栏中的页面。

from selenium import webdriver
import time

def twitter_data():

    driver = webdriver.Chrome()
    driver.get('https://twitter.com/?lang=en')

    driver.find_element_by_xpath('//input[@id="signin-email"]').send_keys('username')
    driver.find_element_by_xpath('//input[@id="signin-password"]').send_keys('password')
    driver.find_element_by_xpath('//button[@type="submit"]').click()
    driver.implicitly_wait(15)

    #Clicking the viewall link
    driver.find_element_by_xpath("//small[@class='view-all']//a[contains(@class,'js-view-all-link')]").click()
    time.sleep(10)

    for links in driver.find_elements_by_xpath("//div[@class='stream-item-header']//a[contains(@class,'js-user-profile-link')]"):
        processing_files(links.get_attribute("href"))

    driver.quit()
        #going on to the each profile falling under viewall section
def processing_files(item_link):

    driver1 = webdriver.Chrome()
    driver1.get(item_link)
    # getting information of each profile holder
    for prof in driver1.find_elements_by_xpath("//div[@class='route-profile']"):
        name = prof.find_elements_by_xpath(".//h1[@class='ProfileHeaderCard-name']//a[contains(@class,'ProfileHeaderCard-nameLink')]")[0]
        tweet = prof.find_elements_by_xpath(".//span[@class='ProfileNav-value']")[0]
        following = prof.find_elements_by_xpath(".//span[@class='ProfileNav-value']")[1]
        follower = prof.find_elements_by_xpath(".//span[@class='ProfileNav-value']")[2]
        print(name.text, tweet.text, following.text, follower.text)
        driver1.quit ()

twitter_data()