如何使用硒将所有页面抓取并以所需格式将抓取的数据上传到excel中

时间:2020-03-05 05:40:21

标签: python python-3.x selenium selenium-webdriver

在这里,我想从https://www.indeed.co.in/?r=us抓取教师职位,我希望将其上传到excel表中,例如职位,学院/学校,薪水, 我这样编写了用于抓取的代码,但是我从定义的xpath中获取了所有文本

import selenium.webdriver

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions


url = 'https://www.indeed.co.in/?r=us'
driver = webdriver.Chrome(r"mypython/bin/chromedriver_linux64/chromedriver")
driver.get(url)

driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()


items = driver.find_elements_by_xpath('//*[@id="resultsCol"]')
for item in items:
    print(item.text)

即使我只能够抓取一页,我也希望在搜索老师之后所有可用的页面 请帮助我,谢谢。

3 个答案:

答案 0 :(得分:1)

我鼓励您检查美丽的汤https://pypi.org/project/beautifulsoup4/ 我用它来刮桌子

def read_table(table):
    """Read an IP Address table.
    Args:
      table: the Soup <table> element
    Returns:
      None if the table isn't an IP Address table, otherwise a list of
        the IP Address:port values.
    """
    header = None
    rows = []
    for tr in table.find_all('tr'):
        if header is None:
            header = read_header(tr)
            if not header or header[0] != 'IP Address':
                return None
        else:
            row = read_row(tr)
            if row:
                rows.append('{}:{}'.format(row[0], row[1]))
    return rows

这只是我的一个Python项目https://github.com/backslash/WebScrapers/blob/master/us-proxy-scraper/us-proxy.py的摘录 您可以使用美丽的汤轻松地刮桌子,如果您担心它被阻塞,则只需要发送正确的标题即可。使用美丽汤的另一个优点是您不必等待很多东西。

HEADERS = requests.utils.default_headers()
HEADERS.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

祝你好运

答案 1 :(得分:0)

您必须导航到每一页并一步一步地将其废弃,即,您必须自动单击硒中的下一页按钮(使用“下一页”按钮元素的xpath)。然后使用页面源函数进行提取。 希望我能帮上忙。

答案 2 :(得分:0)

尝试一下,别忘了导入硒模块

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

url = 'https://www.indeed.co.in/?r=us'

driver.get(url)

driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()

 # scrape data
 data = WebDriverWait(driver, 10).until(
          EC.presence_of_element_located((By.ID, "resultsCol")))
 result_set = WebDriverWait(data, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))

for result in result_set:
    data = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "resultsCol")))
    result_set = WebDriverWait(data, 10).until(
       EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))

    for result in result_set:

        title = result.find_element_by_class_name("title").text
        print(title)

        school = result.find_element_by_class_name("company").text
        print(school)

        try:
           salary = result.find_element_by_class_name("salary").text
           print(salary)

        except:
           # some result set has no salary
           pass
        print("--------")

   # move to next page
   next_page = result.find_elements_by_xpath("//span[@class='pn']")[-1]
   driver.execute_script("arguments[0].click();", next_page)