用硒刮美丽的汤

时间:2020-08-25 08:50:34

标签: python selenium selenium-webdriver beautifulsoup

我正在学习如何将美丽的汤与硒一起使用,我发现了一个有多个表并找到表标签的网站(第一次处理它们)。我正在学习如何尝试从每个表中抓取这些文本并将每个元素附加到受尊重的列表中。首先,我试图刮擦第一张桌子,其余的我想自己做。但是由于某种原因,我无法访问该标签。

我还结合了硒来访问这些站点,因为当我将链接复制到另一个标签上时,由于某种原因,表的列表就会消失。

到目前为止,我的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

targetSite =  "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)

select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')

select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")

driver.find_element_by_name("submit").click()


targetSite   = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []

try:
    page = requests.get(targetSite )
    soup = BeautifulSoup(page.text, 'html.parser')
    items = soup.find_all('table', {"class":"popdetail"})
    for i in items:
        event_title.append(item.find('b', {'class': "text"})).text.strip()
        name.append(item.find('td', {'class': "text"})).text.strip()
        address.append(item.find('td', {'class': "text"})).text.strip()
        city.append(item.find('td', {'class': "text"})).text.strip()
        state.append(item.find('td', {'class': "text"})).text.strip()
        zipCode.append(item.find('td', {'class': "text"})).text.strip()

有人可以让我知道我是否做得正确,这是我第一次处理网站的url元素,当将其复制到新的选项卡和/或窗口时消失了

到目前为止,我无法将任何信息添加到每个列表中。

1 个答案:

答案 0 :(得分:2)

一个问题是new { @class = "btn btn-default disabled"} 循环。

您有for,但是您正在呼叫for i in items:而不是item

其次,如果您使用硒渲染页面,则可能应该使用硒来获取html。它们在表中也有一些嵌入式表,因此它不像遍历i标记那样简单。我最终要做的是在表中读取大熊猫(返回数据框列表),然后遍历那些,因为存在着如何构造数据框的模式。

<table>

输出:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

targetSite =  "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)

select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')

select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")

driver.find_element_by_name("submit").click()


targetSite   = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []

dfs = pd.read_html(driver.page_source)
driver.close  

for idx, table in enumerate(dfs):
    if table.iloc[0,0] == 'Event Title':
        event_title.append(table.iloc[-1,0])
        tempA = dfs[idx+1]
        tempA.index = tempA[0]
        
        tempB = dfs[idx+4]
        tempB.index = tempB[0]
        
        tempC = dfs[idx+5]
        tempC.index = tempC[0]
        
        name.append(tempA.loc['Name',1])
        address.append(tempA.loc['Address',1])
        city.append(tempA.loc['City',1])
        state.append(tempA.loc['State',1])
        zipCode.append(tempA.loc['Zip',1])
        location.append(tempA.loc['Location',1])
        webSite.append(tempA.loc['Web Site',1])
        
        fee.append(tempB.loc['Fee',1])
        event_dates.append(tempB.loc['Dates',1])
        opening_dates.append(tempB.loc['Opening Days',1])
        
        description.append(tempC.loc['Event Description',1])
        
df = pd.DataFrame({'event_title':event_title,
                    'name':name,
                    'address':address,
                    'city':city,
                    'state':state,
                    'zipCode':zipCode,
                    'location':location,
                    'webSite':webSite,
                    'fee':fee,
                    'event_dates':event_dates,
                    'opening_dates':opening_dates,
                    'description':description})