beautifulsoup无法提取href链接

时间:2017-06-08 11:25:38

标签: python selenium web-scraping beautifulsoup phantomjs

所以我使用selenium,phantomjs作为我的webdriver,以及beautifulsoup。 目前我想提取属性标题下的所有链接。 The site i want to extract

然而,似乎根本没有拿起这些链接!到底是怎么回事 ?

# The standard library modules
import os
import sys
import re

# The wget module
import wget

# The BeautifulSoup module
from bs4 import BeautifulSoup

# The selenium module
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


def getListLinks(link):
    #setup drivers
    driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
    driver.get(link) # load the web page
    src = driver.page_source 

    #Get text and split it
    soup = BeautifulSoup(src, 'html5lib')
    print soup
    links = soup.find_all('a')
    print links    

    driver.close()

getListLinks("http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=FA&sub_category=FA1&alphabetical=All&company=9695&date_from=01/01/2012&date_to=31/12/2016")

以下是我想要提取的链接示​​例

<a href="/market/listed-companies/company-announcements/5455245">Quarterly rpt on consolidated results for the financial period ended 31/03/2017</a>

2 个答案:

答案 0 :(得分:2)

我真正不明白的是为什么你要将美丽的硒与硒混合。 Selenium拥有它自己的api for extract dom element。您不需要将BS4带入图片中。除了BS4只能使用静态HTML并忽略动态生成的HTML,你的selenium实例能够处理。

只做

driver.find_element_by_tag_name('a')

答案 1 :(得分:0)

您想要标题列(该列的第四列)下的链接。您可以使用nth-of-type选择器将目标单元格(td元素)限制在目标表每行的4列内。添加了一个等待条件以确保存在元素。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

d = webdriver.Chrome()
url = 'http://www.bursamalaysia.com/market/listed-companies/company-announcements/#/?category=all'
d.get(url)
links = [link.get_attribute('href') for link in WebDriverWait(d, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'tr  td:nth-of-type(4) a')))]
print(links)
d.quit()