我正在尝试抓取所有.pdf链接,pdf的标题以及在此webpage上收到它的时间。在尝试从页面查找href链接时,我尝试了以下代码-
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.bseindia.com/corporates/ann.html?scrip=532538').text
soup = BeautifulSoup(source, 'lxml')
for link in soup.find_all('a'):
if link.has_attr('href'):
print(link.attrs['href'])
我得到以下输出-
{{CorpannData.Table[0].NSURL}}
{{CorpannData.Table[0].NSURL}}
#
/xml-data/corpfiling/AttachLive/{{cann.ATTACHMENTNAME}}
/xml-data/corpfiling/AttachHis/{{cann.ATTACHMENTNAME}}
/xml-data/corpfiling/AttachLive/{{CorpannDataByNewsId[0].ATTACHMENTNAME}}
/xml-data/corpfiling/AttachHis/{{CorpannDataByNewsId[0].ATTACHMENTNAME}}
我想要的输出是像这样获得所有pdf链接:
https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
https://www.bseindia.com/xml-data/corpfiling/AttachHis/d2355247-3287-4c41-be61-2a5655276e79.pdf
(可选),我希望整个程序的输出是-
Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
Exchange received time: 19-12-2019 13:49:14
PDF link: https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
...
并使程序每秒查找一次网页上的新更新。
答案 0 :(得分:1)
override fun attachBaseContext(newBase: Context?) {
val lang = SPUtils.getInstance(newBase).getStringValue(SPUtils.SP_LANG, "en") //en or ar
val locale = Locale(lang)
val context = ContextWrapper.wrap(newBase, locale)
super.attachBaseContext(context)
}
输出:
import requests
r = requests.get(
'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w?strCat=-1&strPrevDate=&strScrip=532538&strSearch=A&strToDate=&strType=C').json()
data = []
for item in r['Table']:
if item['News_submission_dt'] is None:
item['News_submission_dt'] = "N/A"
else:
item['News_submission_dt'] = item['News_submission_dt'].replace(
"T", " ")
if len(item['ATTACHMENTNAME']) == 0:
item['ATTACHMENTNAME'] = "N/A"
else:
item['ATTACHMENTNAME'] = f"https://www.bseindia.com/xml-data/corpfiling/AttachHis/{item['ATTACHMENTNAME']}"
item = item['NEWSSUB'], item[
'News_submission_dt'], item['ATTACHMENTNAME']
print(
f"Title: {item[0]}\nExchange received time: {item[1]}\nPDF: {item[2]}")
等等...
输出到Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
Exchange received time: 2019-12-19 13:49:14
PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/e525dbbb-5ec1-4327-a5ea-9662c66f32a5.pdf
Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
Exchange received time: 2019-12-16 15:48:22
PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/d2355247-3287-4c41-be61-2a5655276e79.pdf
Title: Announcement under Regulation 30 (LODR)-Analyst / Investor Meet - Intimation
Exchange received time: 2019-12-16 09:50:00
PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/6d7ba756-a541-4c85-b711-7270db7cb003.pdf
Title: Allotment Of Non-Convertible Debentures
Exchange received time: 2019-12-11 16:44:33
PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/cdb18e51-725f-43ac-b01f-89f322ae2f5b.pdf
Title: Lntimation Regarding Change Of Name Of Karvy Fintech Private Limited, Registrar & Transfer Agents
Exchange received time: 2019-12-09 15:48:49
PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/9dd527d7-d39d-422d-8de8-c428c24e169e.pdf
Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
Exchange received time: 2019-12-05 14:44:23
PDF: https://www.bseindia.com/xml-data/corpfiling/AttachHis/38af1a6e-a597-47e7-85b8-b620a961df84.pdf
Title: Compliances-Reg. 39 (3) - Details of Loss of Certificate / Duplicate Certificate
文件:
CSV