如何从Web链接收集数据和下载文件

时间:2019-02-09 05:04:36

标签: python python-3.x selenium selenium-webdriver

我有一个链接,我想从那里收集公告详细信息并使用Python下载附件。

url ='https://www.nseindia.com/corporates/corporateHome.html'

打开“公司公告-股票”标签

Link for Collecting data

我想收集数据。

  1. 公告
  2. 附件的网址链接
  3. 将附件下载到本地驱动器

1 个答案:

答案 0 :(得分:2)

由于requests.get()返回数据,因此无需使用Selenium。但是很遗憾,返回的不是application/json,而是text/html;charset=ISO-8859-1

但是,数据是以json结构发送的,因此需要对字符串进行操作以使其能够使用json进行读取。然后,您可以将其转储到表中以获取数据。

然后获取pdf,然后需要遍历所获得的那些链接,并将其写入磁盘:

import requests
import json
from pandas.io.json import json_normalize
import bs4


base_url = 'https://www.nseindia.com'
url = 'https://www.nseindia.com/corporates/directLink/latestAnnouncementsCorpHome.jsp'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

response = requests.get(url, headers=headers)

jsonStr = response.text.strip()

keys_needing_quotes = ['company:','date:','desc:','link:','symbol:']

for key in keys_needing_quotes:
    jsonStr = jsonStr.replace(key, '"%s":' %(key[:-1]))

data = json.loads(jsonStr)
data = data['rows']

# puts the data into dataframe
df = json_normalize(data)
links = [ base_url + ele['link'] for ele in data ]


for link in links:
    response = requests.get(link, headers=headers)
    soup = bs4.BeautifulSoup(response.text, 'html.parser')

    try:
        pdf_file = base_url + soup.find_all('a', href=True)[0]['href']
    except:
        print ('PDF not found')

    path = 'C:/path/to/file/'
    filename = path + pdf_file.split('/')[-1]

    response = requests.get(pdf_file)
    with open(filename, 'wb') as f:
        f.write(response.content)

输出:

此处为数据框的外观。 PDF文件将被写入到您选择放置它们的任何位置。请注意,有些是包含pdf的zip文件。我不担心解压缩这些文件,尽管您可以在编写之前将其作为附加步骤添加(即,如果文件是zip,则添加sudo,解压缩以获取pdf,然后写入磁盘。如果文件是pdf,则只需写入磁盘。)

print (df)
                                   company     ...          symbol
0                 RELIANCE CAPITAL LIMITED     ...      RELCAPITAL
1          RELIANCE INFRASTRUCTURE LIMITED     ...        RELINFRA
2                    GRAND FOUNDRY LIMITED     ...      GRANDFONRY
3                    VRL LOGISTICS LIMITED     ...          VRLLOG
4                    GRAND FOUNDRY LIMITED     ...      GRANDFONRY
5   EUROTEX INDUSTRIES AND EXPORTS LIMITED     ...      EUROTEXIND
6                     PSP PROJECTS LIMITED     ...      PSPPROJECT
7                    VRL LOGISTICS LIMITED     ...          VRLLOG
8             THE UGAR SUGAR WORKS LIMITED     ...       UGARSUGAR
9                     ZUARI GLOBAL LIMITED     ...       ZUARIGLOB
10                   VRL LOGISTICS LIMITED     ...          VRLLOG
11                  RUPA & COMPANY LIMITED     ...            RUPA
12                 ANIK INDUSTRIES LIMITED     ...        ANIKINDS
13                 ARROW GREENTECH LIMITED     ...      ARROWGREEN
14       CENTURY PLYBOARDS (INDIA) LIMITED     ...      CENTURYPLY
15                     TARA JEWELS LIMITED     ...      TARAJEWELS
16           INDO COUNT INDUSTRIES LIMITED     ...            ICIL
17         LUMAX AUTO TECHNOLOGIES LIMITED     ...       LUMAXTECH
18                BLISS GVS PHARMA LIMITED     ...        BLISSGVS
19  EUROTEX INDUSTRIES AND EXPORTS LIMITED     ...      EUROTEXIND

[20 rows x 5 columns]