在这里,我想从https://www.indeed.co.in/?r=us抓取教师职位,我希望将其上传到excel表中,例如职位,学院/学校,薪水, 我这样编写了用于抓取的代码,但是我从定义的xpath中获取了所有文本
import selenium.webdriver
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
url = 'https://www.indeed.co.in/?r=us'
driver = webdriver.Chrome(r"mypython/bin/chromedriver_linux64/chromedriver")
driver.get(url)
driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()
items = driver.find_elements_by_xpath('//*[@id="resultsCol"]')
for item in items:
print(item.text)
即使我只能够抓取一页,我也希望在搜索老师之后所有可用的页面 请帮助我,谢谢。
答案 0 :(得分:1)
我鼓励您检查美丽的汤https://pypi.org/project/beautifulsoup4/ 我用它来刮桌子
def read_table(table):
"""Read an IP Address table.
Args:
table: the Soup <table> element
Returns:
None if the table isn't an IP Address table, otherwise a list of
the IP Address:port values.
"""
header = None
rows = []
for tr in table.find_all('tr'):
if header is None:
header = read_header(tr)
if not header or header[0] != 'IP Address':
return None
else:
row = read_row(tr)
if row:
rows.append('{}:{}'.format(row[0], row[1]))
return rows
这只是我的一个Python项目https://github.com/backslash/WebScrapers/blob/master/us-proxy-scraper/us-proxy.py的摘录 您可以使用美丽的汤轻松地刮桌子,如果您担心它被阻塞,则只需要发送正确的标题即可。使用美丽汤的另一个优点是您不必等待很多东西。
HEADERS = requests.utils.default_headers()
HEADERS.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
祝你好运
答案 1 :(得分:0)
您必须导航到每一页并一步一步地将其废弃,即,您必须自动单击硒中的下一页按钮(使用“下一页”按钮元素的xpath)。然后使用页面源函数进行提取。 希望我能帮上忙。
答案 2 :(得分:0)
尝试一下,别忘了导入硒模块
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.indeed.co.in/?r=us'
driver.get(url)
driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()
# scrape data
data = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "resultsCol")))
result_set = WebDriverWait(data, 10).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))
for result in result_set:
data = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "resultsCol")))
result_set = WebDriverWait(data, 10).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))
for result in result_set:
title = result.find_element_by_class_name("title").text
print(title)
school = result.find_element_by_class_name("company").text
print(school)
try:
salary = result.find_element_by_class_name("salary").text
print(salary)
except:
# some result set has no salary
pass
print("--------")
# move to next page
next_page = result.find_elements_by_xpath("//span[@class='pn']")[-1]
driver.execute_script("arguments[0].click();", next_page)