我要提取链接
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
来自页面的html
http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05
以下是使用的代码
url_list = "http://www.moneycontrol.com/company-article/piramalenterprises/news/PH05#PH05"
html = requests.get(url_list)
soup = BeautifulSoup(html.text,'html.parser')
link = soup.find_all('a')
print(link)
使用漂亮的汤。我该怎么做,使用find_all('a“)不会在返回的html中返回所需的链接。
答案 0 :(得分:1)
您只需要使用Caused by: java.lang.NullPointerExecption
at org.apache.cxf.jaxrs.impl.AsyncResponseImpl.initContinuation(AsyncResponseImpl.java:306)
at org.apache.cxf.jaxrs.impl.AsyncResponseImpl.<init>(AsyncResponseImpl.java:68)
at org.apache.cxf.jaxrs.sse.SseEventSinkContextProvider.createContext(SseEventSinkContextProvider.java:47)
......
方法来找到get
属性:
href
答案 1 :(得分:1)
请尝试此操作以获取所需的确切网址。
import bs4 as bs
import requests
import re
sauce = requests.get('https://www.moneycontrol.com/stocks/company_info/stock_news.php?sc_id=CHC&durationType=Y&Year=2018')
soup = bs.BeautifulSoup(sauce.text, 'html.parser')
for a in soup.find_all('a', href=re.compile("company_info")):
# print(a['href'])
if 'pageno' in a['href']:
print(a['href'])
输出:
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=2&next=0&durationType=Y&Year=2018&duration=1&news_type=
/stocks/company_info/stock_news.php?sc_id=CHC&scat=&pageno=3&next=0&durationType=Y&Year=2018&duration=1&news_type=