无法从布局复杂的表格中抓取三个字段

时间:2019-08-09 09:00:58

标签: python python-3.x selenium selenium-webdriver web-scraping

我已经用python和硒创建了一个脚本,以从网站上可用的表中解析三个字段franking creditgross dividentfurther information。仅当使浏览器单击其中带有加号的 圆形黄色按钮 时,才会显示最后两个字段。

但是,单击按钮时,它们变为红色,表示已显示信息。

  

我的脚本可以单击所有按钮,但不能从该表中抓取三个字段。

我已附上一张图片,向您展示它的真实外观。

我知道是否向此https://www.sharedividends.com.au/wp-content/custom/ajaxfile.php?code=MLT发送带有相关有效载荷的帖子http请求,我可以获取json中的所有表格字段,但这不是我想要解决的方式。

Website link

我尝试过:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://www.sharedividends.com.au/mlt-dividend-history/"

driver = webdriver.Chrome()

driver.get(url)

table = driver.find_element_by_css_selector("#divTable")
driver.execute_script("arguments[0].scrollIntoView();",table)

for items in driver.find_elements_by_css_selector("td.sorting_1"):
    driver.execute_script("arguments[0].scrollIntoView();",items)
    items.click()

for elems in driver.find_elements_by_css_selector("#divTable tbody tr"):
    franking_credit = elems.find_elements_by_css_selector("td")[5].text
    gross_divident = elems.find_elements_by_css_selector("td")[6].text
    further_info = elems.find_elements_by_css_selector("td")[7].text
    print(franking_credit,gross_divident,further_info)

driver.quit()

我在运行上述脚本时,会抛出此错误IndexError: list index out of range指向此行的franking_credit =

这是该表的外观。我在下面感兴趣的图像中标记了该表中的三个字段。

Image link

如何解析该表中的三个字段?

3 个答案:

答案 0 :(得分:1)

这应该可以解决问题!

from selenium import webdriver

driver = webdriver.Chrome('chromedriver/chromedriver.exe')

driver.get("https://www.sharedividends.com.au/mlt-dividend-history/")

for button in driver.find_elements_by_class_name("sorting_1"):
    button.click()

# Returns first part of the info
for item in driver.find_elements_by_xpath("//tr[@role='row']/td"):
    print(item.text)

# Returns second part of info
for a in driver.find_elements_by_xpath("//ul[@class='dtr-details']/li"):
        print(a.text)

输出; this

答案 1 :(得分:1)

由于运行自动化脚本时它显示具有其他属性的20行而不是10行,因此出现以下错误。请尝试以下代码。

data

控制台上的输出:

 //-------------------------------
    // find a postt-----------------
    //-------------------------------

    app.get('/post/:id', function (req, res) {

        Post.findById({_id: req.params.id}, function (err, post) {
            var data = {
                id: req.params.id,
                title: post.title,
                content:post.content
            };
            res.status(200).json(data);
        });
    });

答案 2 :(得分:0)

要从三个字段 Franking Credit Gross Divident Further Information 中提取数据,您必须引入 WebDriverWait 用作visibility_of_all_elements_located(),则可以使用以下Locator Strategies

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    chrome_options = webdriver.ChromeOptions() 
    chrome_options.add_argument("start-maximized")
    chrome_options.add_argument('disable-infobars')
    driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.sharedividends.com.au/mlt-dividend-history/")
    driver.execute_script("arguments[0].scrollIntoView();", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#divTable"))))
    for elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr/td[@class='sorting_1']"))):
        elem.click()
    all_fc = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr//td[position()=6]")))]
    all_gd = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr//td[position()=7]")))]
    all_fi = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@aria-describedby='divTable_info']//tbody//tr[@class='child']//li//span[@class='dtr-data']")))]
    for x,y,z in zip(all_fc, all_gd, all_fi):
        print(x,y,z)
    
  • 控制台输出:

    $ 0.0446 $ 0.1486 10.4C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0107 $ 0.0357 2.5C FRANKED@30%; SP ECIAL; DRP SUSP
    
    $ 0.0386 $ 0.1286 9C FRANKED @ 30%; DR P NIL DISCOUNT
    
    $ 0.0437 $ 0.1457 10.2C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0377 $ 0.1257 8.8C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0429 $ 0.1429 10C FRANKED @ 30%; D RP NIL DISCOUNT
    
    $ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0424 $ 0.1414 9.9C FRANKED @ 30%; DRP NIL DISCOUNT
    
    $ 0.0373 $ 0.1243 8.7C FRANKED @ 30%; DRP
    
    $ 0.0441 $ 0.1471 10.3C FR@30%;0.4C SP ECIAL;DRP;NIL DIS