遍历BeautifulSoup中的一个元素,但仅输出该元素的子元素

时间:2019-04-30 16:57:04

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我当前的问题是确定如何使用BeautifulSoup和Selenium来通过Web刮刮一个名为Rocket League eSports的电子竞技网站。

我能够找到数据,并且由于其网页上使用了脚本,因此我使用了Selenium。然后,我使用BeautifulSoup来获取数据。从这里我可以导出所有团队名称,但是在添加列表时,我一直在列表中得到“ None”。

from selenium import webdriver
from bs4 import BeautifulSoup
#import soupsieve
import time

#create a Google Chrome session
browser = 
webdriver.Chrome(executable_path='/home/jdr1018/chromedriver')

#maximizes Google Chrome window
browser.maximize_window()

#fetches the URL
browser.get('https://www.rocketleagueesports.com/stats/')

#pause to allow page to load
time.sleep(4)

#search the container and find all elements with h5 tag to print 
given elements
#container = browser.find_elements_by_tag_name('h5')

#hand over Selenium page source to Beautiful BeautifulSoup
soup_source = BeautifulSoup(browser.page_source, 'lxml')

namelist = [] #empty list for Team names

winpercentlist = [] #empty list for Win Percentage

rocketleaguedict = {} #empty dict for namelist + winpercentlist

#using XPath to find h5 element with class name and assinging it to 
teamnames
elements = browser.find_elements_by_xpath('//h5[@class="name"]/a')
teamnames = [element.text for element in elements]
#loop through team names to get each individual team name
for name in teamnames:
    #if statement to determine if name is already in the list
    if name in namelist:
        #append each team name through loop into empty list.
        pass
    else:
        namelist.append(name)
#return namelist to verify
return namelist
#for i in container:
   #print(i.get_attribute("innerHTML"))

#once program is done close Google Chrome
browser.close()}

我的输出类似于以下内容:

['CHIEFS ESPORTS CLUB']
['CHIEFS ESPORTS CLUB', 'NRG ESPORTS']
['CHIEFS ESPORTS CLUB', 'NRG ESPORTS', 'ICON ESPORTS']
['CHIEFS ESPORTS CLUB', 'NRG ESPORTS', 'ICON ESPORTS', 'RENAULT SPORT 
TEAM VITALITY']
['CHIEFS ESPORTS CLUB', 'NRG ESPORTS', 'ICON ESPORTS', 'RENAULT SPORT 
TEAM VITALITY', 'ERODIUM']
['CHIEFS ESPORTS CLUB', 'NRG ESPORTS', 'ICON ESPORTS', 'RENAULT SPORT 
TEAM VITALITY', 'ERODIUM', 'LOWKEY ESPORTS'] ...

这不完全是,但是重点是它们是这些“无”的一堆,我不知道确切的原因。

2 个答案:

答案 0 :(得分:1)

使用此:

elements = browser.find_elements_by_xpath('//h5[@class="name"]/a')
teamnames = [element.text for element in elements]

说明您的方法为何无效:

您的解决方案有许多None,因为列'G','G / GM'等下的值也具有相同的html标记名和类。

enter image description here

因此,teamnames是包含数字的元素列表,并且它们中没有<a href>...</a> html内容。当不存在这样的元素(link to BeautifulSoup documentation on find())时,调用方法name.find('a')会返回None,因此得到一系列的6 None

答案 1 :(得分:0)

您可以使用正则表达式和请求来获取团队名称。正则表达式可能会变得更加高效(对此我将不胜感激)

import requests
import re

res = requests.get('https://www.rocketleagueesports.com/ajax/standings-script/?league=7-57d5ab4-qm0qcw&season=7-cab6afe099-06tjgk&region=0&stage=7-57d5ab4-g1dsq3')
r = re.compile(r'name: "((?:(?!").)*)')
teams = r.findall(res.text)

输出示例:


正则表达式:

查看正则表达式和说明here

它基本上针对脚本标记中格式为name: "TeamName"的字符串。否定的前瞻性是要确保我以“在队名之后”停止,而不是在最后一个队名之后以“”结束的一场长赛来获得每个队的名字。

enter image description here

其他参考:

  1. https://www.regular-expressions.info/tutorial.html
  2. https://www.regular-expressions.info/lookaround.html