从以下网站的链接中提取HTML链接

时间:2019-02-21 09:59:43

标签: python web-scraping beautifulsoup

我要提取链接

/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=

来自页面的html

http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05

以下是使用的代码

url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
soup = BeautifulSoup(html.text,'html.parser')
link = soup.find_all('a')
print(link)

使用漂亮的汤。我该怎么做,使用find_all('a“)不会在返回的html中返回所需的链接。

2 个答案:

答案 0 :(得分:1)

您只需要使用Caused by: java.lang.NullPointerExecption at org.apache.cxf.jaxrs.impl.AsyncResponseImpl.initContinuation(AsyncResponseImpl.java:306) at org.apache.cxf.jaxrs.impl.AsyncResponseImpl.<init>(AsyncResponseImpl.java:68) at org.apache.cxf.jaxrs.sse.SseEventSinkContextProvider.createContext(SseEventSinkContextProvider.java:47) ...... 方法来找到get属性:

href

答案 1 :(得分:1)

请尝试此操作以获取所需的确切网址。

import bs4 as bs
import requests
import re


sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')

soup = bs.BeautifulSoup(sauce.text, 'html.parser')

for a in soup.find_all('a', href=re.compile("company_info")):
   # print(a['href'])
    if 'pageno' in a['href']:
        print(a['href'])

输出:

/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=