试图循环浏览网页以抓取所有足球运动员的姓名,但只得到第一个?

时间:2019-10-18 17:01:16

标签: python selenium

我正在尝试在阿拉巴马州足球名册上刮擦所有球员的名字,这些名字在这里找到:https://rolltide.com/roster.aspx?roster=226&path=football

我能够获得第一个玩家的名字,但是在他之后就停止了,并且没有其他任何玩家的名字。

这是我的代码:


DesiredRoster = (URLEntry.get())

driver = webdriver.Firefox()

driver.get(DesiredRoster)

#Player Name

Name = driver.find_element_by_class_name('sidearm-roster-player-name')
PlayerName = Name.find_element_by_tag_name('a').text
print(PlayerName)

我如何遍历该网页以获取所有名称?


numbers = driver.find_elements_by_class_name('sidearm-roster-player-jersey-number')
print(numbers.text)

AttributeError:“列表”对象没有属性“文本”

如果我将elements更改为element,很奇怪的是它将打印第一个玩家编号

3 个答案:

答案 0 :(得分:2)

您正在使用driver方法,该方法仅返回单个值find_element_by_class_name,切换到find_elements_by_class_name以获得列表,然后遍历该列表:

names = driver.find_elements_by_class_name('sidearm-roster-player-name')
for name in names:
    player_name = name.find_element_by_tag_name('a').text
    print(player_name)

答案 1 :(得分:2)

就我而言,至少需要一个User-Agent标头,然后我可以使用requests。然后,您可以使用css类选择器收集父节点,然后循环这些父节点并将所需的信息提取到数据框中。再次,使用更快和更短的CSS选择器。如前所述,关键是在这种情况下通过使用select来收集所有父节点。这比硒要少。


Py:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://rolltide.com/roster.aspx?roster=226&path=football', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
results = {}

for num, p in enumerate(soup.select('.sidearm-roster-player')):
    results[num] = {'position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
           ,'height': p.select_one('.sidearm-roster-player-height').text
           ,'weight': p.select_one('.sidearm-roster-player-weight').text
           ,'number': p.select_one('.sidearm-roster-player-jersey-number').text
           ,'name': p.select_one('.sidearm-roster-player-name a').text
           ,'year': p.select_one('.sidearm-roster-player-academic-year').text
           ,'hometown': p.select_one('.sidearm-roster-player-hometown').text
           ,'highschool': p.select_one('.sidearm-roster-player-highschool').text
          }
df = pd.DataFrame(results.values(), columns = ['position','height','weight','number','name','year','hometown','highschool'])
print(df)

R:

purrr用于处理父节点上的循环以写入df。 str_squish中的stringr用于整理循环中一个子节点的输出。 httr用于提供标题。

library(httr)
library(purrr)
library(rvest)
library(stringr)

headers = c('User-Agent' = 'Mozilla/5.0')
pg <- content(httr::GET(url = 'https://rolltide.com/roster.aspx?roster=226&path=football', httr::add_headers(.headers=headers)))

df <- map_df(pg%>%html_nodes('.sidearm-roster-player'), function(item) {

     data.frame(position = str_squish(item%>%html_node('.sidearm-roster-player-position >span:first-child')%>%html_text()),
                height = item%>%html_node('.sidearm-roster-player-height')%>%html_text(),
                weight = item%>%html_node('.sidearm-roster-player-weight')%>%html_text(),
                number = item%>%html_node('.sidearm-roster-player-jersey-number')%>%html_text(),
                name = item%>%html_node('.sidearm-roster-player-name a')%>%html_text(),
                year = item%>%html_node('.sidearm-roster-player-academic-year')%>%html_text(),
                hometown = item%>%html_node('.sidearm-roster-player-hometown')%>%html_text(),
                highschool = item%>%html_node('.sidearm-roster-player-highschool')%>%html_text(),
                stringsAsFactors=FALSE)
     })

View(df)

答案 2 :(得分:0)

对于任何想使用R(rvest)的人,下面的代码将花名册数据收集到数据帧中:

library(tidyverse)
library(magrittr)
library(rvest)

url <- "https://rolltide.com/roster.aspx?roster=226&path=football"
page <- url %>% read_html()

position <- list()
height <- list()
weight <- list()
number <- list()
name <- list()
yr <- list()
hometown <- list()
high.school <- list()

for (i in seq(1,250)) {
    position[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[1]/text()')) %>% xml_text %>% str_trim
    height[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[2]')) %>% xml_text
    weight[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[3]/text()')) %>% xml_text
    number[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[2]/span/span')) %>% xml_text
    name[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[2]/p/a')) %>% xml_text
    yr[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[1]')) %>% xml_text
    hometown[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[2]/text()')) %>% xml_text
    high.school[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[3]/text()')) %>% xml_text
}

position    %<>% tibble %>% unnest
height      %<>% tibble %>% unnest
weight      %<>% tibble %>% unnest
number      %<>% tibble %>% unnest
name        %<>% tibble %>% unnest
yr          %<>% tibble %>% unnest
hometown    %<>% tibble %>% unnest
high.school %<>% tibble %>% unnest

final <- bind_cols(position,height,weight,number,name,yr,hometown,high.school)
names(final) <- c("position","height","weight","number","name","yr","hometown","high.school")

诀窍是选择Xpath而不是CSS选择器,并在xpath=调用中使用html_nodes()

这显然有点丑陋,但是它不需要硒或其他繁琐的设置。

编辑:您应该查看上面的QHarr答案,以获得更简化的代码。