API和网页抓取

时间:2019-08-19 16:43:48

标签: python api web-scraping beautifulsoup automation

我正在尝试访问page上文本文件中的内容。由于每个文本文件都有不同的URL,所以我无法在python中生成URL并使用Pandas抓取内容。因此,我试图为此使用API。当执行用户令牌时,我得到的是这样的:

{
  "jwt": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOjU5MDR9.b9elxkmNj0kmWxDPjal0_mLY9UPg7enoT7Cdg7gN1d0"
}

现在,我不确定如何使用它来访问我上面提到的第一页上的所有文本文件。有人可以指导我如何前进吗?

1 个答案:

答案 0 :(得分:0)

此脚本将从第1页转到最后一页,并选择所有以.txt结尾的链接:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

base_url = 'https://usda.library.cornell.edu'

url = 'https://usda.library.cornell.edu/concern/publications/c821gj76b?locale=en&page=1#release-items'

soup = BeautifulSoup(requests.get(url).text, 'html.parser')

page = 1
while True:
    print('Page no.{}...'.format(page))
    print('-' * 80)

    txt_urls = [a["href"] for a in soup.select('#release-items a[href$=".txt"]')]
    pprint(txt_urls)

    m = soup.select_one('a[rel="next"][href]')
    if m and m['href'] != '#':
        soup = BeautifulSoup(requests.get(base_url + m['href']).text, 'html.parser')
        page += 1
    else:
        break

打印:

Page no.1...
--------------------------------------------------------------------------------
['https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/kd17d5288/ms35tm800/agpr0719.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/r494vw17c/q524jz702/agpr0619.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/bc386t90p/vx021r07n/agpr0519.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/3484zr667/4j03d7561/agpr0419.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/f7623m42k/qf85nk40w/agpr0319.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/7w62fg32b/n009w815n/agpr0219.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/kk91fs55d/z890s0860/agpr0219.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/t435gj88z/8910k0903/agpr0119.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/m613n410w/41687p68x/01-30-19_Report_Reschedule_ASB_Notice_Final.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/st74cv012/0z709086s/agpr1118.txt']
Page no.2...
--------------------------------------------------------------------------------
['https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/5q47rs05x/m900nx65x/agpr1018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/4b29b953w/m900nx64n/agpr0918.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/5h73px257/1c18dh137/AgriPric-08-29-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/t722hb16b/76537257b/AgriPric-07-30-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/pz50gx32d/qb98mg88k/AgriPric-06-28-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/vd66w115f/p2676w80r/AgriPric-05-30-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/9c67wp20r/bc386k622/AgriPric-04-27-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/r494vm201/h128ng14d/AgriPric-03-28-2018.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/z316q273n/37720f04c/AgriPric-02-27-2018_correction.txt',
 'https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/5d86p1433/zp38wd92f/AgriPric-01-30-2018.txt']

...and so on.

您可以使用以下链接下载文本文件,例如:

txt_data = requests.get('https://downloads.usda.library.cornell.edu/usda-esmis/files/c821gj76b/kd17d5288/ms35tm800/agpr0719.txt').text
print(txt_data)

打印(但您可以将其保存到文件中而不是打印到屏幕上):

Agricultural Prices

ISSN: 1937-4216

Released July 31, 2019, by the National Agricultural Statistics Service 
(NASS), Agricultural Statistics Board, United States Department of 
Agriculture (USDA).

June Prices Received Index Up 1.0 Percent 

...etc.