我正在尝试在阿拉巴马州足球名册上刮擦所有球员的名字,这些名字在这里找到:https://rolltide.com/roster.aspx?roster=226&path=football
我能够获得第一个玩家的名字,但是在他之后就停止了,并且没有其他任何玩家的名字。
这是我的代码:
DesiredRoster = (URLEntry.get())
driver = webdriver.Firefox()
driver.get(DesiredRoster)
#Player Name
Name = driver.find_element_by_class_name('sidearm-roster-player-name')
PlayerName = Name.find_element_by_tag_name('a').text
print(PlayerName)
我如何遍历该网页以获取所有名称?
numbers = driver.find_elements_by_class_name('sidearm-roster-player-jersey-number')
print(numbers.text)
AttributeError:“列表”对象没有属性“文本”
如果我将elements
更改为element
,很奇怪的是它将打印第一个玩家编号
答案 0 :(得分:2)
您正在使用driver
方法,该方法仅返回单个值find_element_by_class_name
,切换到find_elements_by_class_name
以获得列表,然后遍历该列表:
names = driver.find_elements_by_class_name('sidearm-roster-player-name')
for name in names:
player_name = name.find_element_by_tag_name('a').text
print(player_name)
答案 1 :(得分:2)
就我而言,至少需要一个User-Agent
标头,然后我可以使用requests
。然后,您可以使用css类选择器收集父节点,然后循环这些父节点并将所需的信息提取到数据框中。再次,使用更快和更短的CSS选择器。如前所述,关键是在这种情况下通过使用select
来收集所有父节点。这比硒要少。
Py:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://rolltide.com/roster.aspx?roster=226&path=football', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
results = {}
for num, p in enumerate(soup.select('.sidearm-roster-player')):
results[num] = {'position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
,'height': p.select_one('.sidearm-roster-player-height').text
,'weight': p.select_one('.sidearm-roster-player-weight').text
,'number': p.select_one('.sidearm-roster-player-jersey-number').text
,'name': p.select_one('.sidearm-roster-player-name a').text
,'year': p.select_one('.sidearm-roster-player-academic-year').text
,'hometown': p.select_one('.sidearm-roster-player-hometown').text
,'highschool': p.select_one('.sidearm-roster-player-highschool').text
}
df = pd.DataFrame(results.values(), columns = ['position','height','weight','number','name','year','hometown','highschool'])
print(df)
R:
purrr
用于处理父节点上的循环以写入df。 str_squish
中的stringr
用于整理循环中一个子节点的输出。 httr
用于提供标题。
library(httr)
library(purrr)
library(rvest)
library(stringr)
headers = c('User-Agent' = 'Mozilla/5.0')
pg <- content(httr::GET(url = 'https://rolltide.com/roster.aspx?roster=226&path=football', httr::add_headers(.headers=headers)))
df <- map_df(pg%>%html_nodes('.sidearm-roster-player'), function(item) {
data.frame(position = str_squish(item%>%html_node('.sidearm-roster-player-position >span:first-child')%>%html_text()),
height = item%>%html_node('.sidearm-roster-player-height')%>%html_text(),
weight = item%>%html_node('.sidearm-roster-player-weight')%>%html_text(),
number = item%>%html_node('.sidearm-roster-player-jersey-number')%>%html_text(),
name = item%>%html_node('.sidearm-roster-player-name a')%>%html_text(),
year = item%>%html_node('.sidearm-roster-player-academic-year')%>%html_text(),
hometown = item%>%html_node('.sidearm-roster-player-hometown')%>%html_text(),
highschool = item%>%html_node('.sidearm-roster-player-highschool')%>%html_text(),
stringsAsFactors=FALSE)
})
View(df)
答案 2 :(得分:0)
对于任何想使用R(rvest
)的人,下面的代码将花名册数据收集到数据帧中:
library(tidyverse)
library(magrittr)
library(rvest)
url <- "https://rolltide.com/roster.aspx?roster=226&path=football"
page <- url %>% read_html()
position <- list()
height <- list()
weight <- list()
number <- list()
name <- list()
yr <- list()
hometown <- list()
high.school <- list()
for (i in seq(1,250)) {
position[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[1]/text()')) %>% xml_text %>% str_trim
height[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[2]')) %>% xml_text
weight[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[3]/text()')) %>% xml_text
number[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[2]/span/span')) %>% xml_text
name[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[2]/p/a')) %>% xml_text
yr[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[1]')) %>% xml_text
hometown[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[2]/text()')) %>% xml_text
high.school[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[3]/text()')) %>% xml_text
}
position %<>% tibble %>% unnest
height %<>% tibble %>% unnest
weight %<>% tibble %>% unnest
number %<>% tibble %>% unnest
name %<>% tibble %>% unnest
yr %<>% tibble %>% unnest
hometown %<>% tibble %>% unnest
high.school %<>% tibble %>% unnest
final <- bind_cols(position,height,weight,number,name,yr,hometown,high.school)
names(final) <- c("position","height","weight","number","name","yr","hometown","high.school")
诀窍是选择Xpath而不是CSS选择器,并在xpath=
调用中使用html_nodes()
。
这显然有点丑陋,但是它不需要硒或其他繁琐的设置。
编辑:您应该查看上面的QHarr答案,以获得更简化的代码。