我有一个链接,我想从那里收集公告详细信息并使用Python下载附件。
url ='https://www.nseindia.com/corporates/corporateHome.html'
打开“公司公告-股票”标签
我想收集数据。
答案 0 :(得分:2)
由于requests.get()
返回数据,因此无需使用Selenium。但是很遗憾,返回的不是application/json
,而是text/html;charset=ISO-8859-1
。
但是,数据是以json结构发送的,因此需要对字符串进行操作以使其能够使用json
进行读取。然后,您可以将其转储到表中以获取数据。
然后获取pdf,然后需要遍历所获得的那些链接,并将其写入磁盘:
import requests
import json
from pandas.io.json import json_normalize
import bs4
base_url = 'https://www.nseindia.com'
url = 'https://www.nseindia.com/corporates/directLink/latestAnnouncementsCorpHome.jsp'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
response = requests.get(url, headers=headers)
jsonStr = response.text.strip()
keys_needing_quotes = ['company:','date:','desc:','link:','symbol:']
for key in keys_needing_quotes:
jsonStr = jsonStr.replace(key, '"%s":' %(key[:-1]))
data = json.loads(jsonStr)
data = data['rows']
# puts the data into dataframe
df = json_normalize(data)
links = [ base_url + ele['link'] for ele in data ]
for link in links:
response = requests.get(link, headers=headers)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
try:
pdf_file = base_url + soup.find_all('a', href=True)[0]['href']
except:
print ('PDF not found')
path = 'C:/path/to/file/'
filename = path + pdf_file.split('/')[-1]
response = requests.get(pdf_file)
with open(filename, 'wb') as f:
f.write(response.content)
输出:
此处为数据框的外观。 PDF文件将被写入到您选择放置它们的任何位置。请注意,有些是包含pdf的zip文件。我不担心解压缩这些文件,尽管您可以在编写之前将其作为附加步骤添加(即,如果文件是zip,则添加sudo,解压缩以获取pdf,然后写入磁盘。如果文件是pdf,则只需写入磁盘。)
print (df)
company ... symbol
0 RELIANCE CAPITAL LIMITED ... RELCAPITAL
1 RELIANCE INFRASTRUCTURE LIMITED ... RELINFRA
2 GRAND FOUNDRY LIMITED ... GRANDFONRY
3 VRL LOGISTICS LIMITED ... VRLLOG
4 GRAND FOUNDRY LIMITED ... GRANDFONRY
5 EUROTEX INDUSTRIES AND EXPORTS LIMITED ... EUROTEXIND
6 PSP PROJECTS LIMITED ... PSPPROJECT
7 VRL LOGISTICS LIMITED ... VRLLOG
8 THE UGAR SUGAR WORKS LIMITED ... UGARSUGAR
9 ZUARI GLOBAL LIMITED ... ZUARIGLOB
10 VRL LOGISTICS LIMITED ... VRLLOG
11 RUPA & COMPANY LIMITED ... RUPA
12 ANIK INDUSTRIES LIMITED ... ANIKINDS
13 ARROW GREENTECH LIMITED ... ARROWGREEN
14 CENTURY PLYBOARDS (INDIA) LIMITED ... CENTURYPLY
15 TARA JEWELS LIMITED ... TARAJEWELS
16 INDO COUNT INDUSTRIES LIMITED ... ICIL
17 LUMAX AUTO TECHNOLOGIES LIMITED ... LUMAXTECH
18 BLISS GVS PHARMA LIMITED ... BLISSGVS
19 EUROTEX INDUSTRIES AND EXPORTS LIMITED ... EUROTEXIND
[20 rows x 5 columns]