如何刮掉循环的多页麻烦?

时间:2019-11-28 10:03:16

标签: python pandas web web-scraping beautifulsoup

这是我的代码,只能刮一页,但是我有11000。区别在于他们的ID。

https://www.rlsnet.ru/mkb_index_id_1.htm
https://www.rlsnet.ru/mkb_index_id_2.htm
https://www.rlsnet.ru/mkb_index_id_3.htm
....
https://www.rlsnet.ru/mkb_index_id_11000.htm

如何循环我的代码以刮取全部11000页?如此大量的页面有可能吗?可以将它们放入列表中,然后再进行抓取,但是如果有11000个,则还有很长的路要走。

import requests
from pandas import DataFrame
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

page_sc = requests.get('https://www.rlsnet.ru/mkb_index_id_1.htm')
soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
items_sc = soup_sc.find_all(class_='subcatlist__item')
mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
mkb_stuff_sce = pd.DataFrame(
    {
        'first': mkb_names_sc,
    })
mkb_stuff_sce.to_csv('/Users/gfidarov/Desktop/Python/MKB/mkb.csv')

2 个答案:

答案 0 :(得分:0)

您可以像这样即时创建url字符串。您可能还希望在循环的其他每个迭代中使用定时延迟,以免被服务器阻止。

import requests
from pandas import DataFrame
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup


path_of_csv = '/Users/gfidarov/Desktop/Python/MKB/mkb.csv'

first_string = 'https://www.rlsnet.ru/mkb_index_id_'
third_string = '.htm'

df = pd.DataFrame(columns=['scraping results'])

try:
    for second_string in range(1, 11001):
        second_string = str(second_string)
        url = first_string + second_string + third_string
        page_sc = requests.get(url)
        soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
        items_sc = soup_sc.find_all(class_='subcatlist__item')
        mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
        df.append({'scraping results': mkb_names_sc}, ignore_index=True)

    df.to_csv(
        path_or_buf=path_of_csv
    )

except:
    # If it fails in the middle of the process, the results won't be lost
    path_of_csv = 'backup_' + path_of_csv
    df.to_csv(
        path_or_buf=path_of_csv 
    )
    print('Failed at index ' + second_string + '. Please start from here again by setting the beginning of the range to this index. A backup was made of the results that were already scraped. You may want to rename the backup to avoid overwriting in the next run.')

答案 1 :(得分:0)

我的方法很简单。我只是遍历上面的代码。

for i in range(1,11001):

    page_sc = requests.get('https://www.rlsnet.ru/mkb_index_id_{}.htm'.format(i))

    soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
    items_sc = soup_sc.find_all(class_='subcatlist__item')
    mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
    mkb_stuff_sce = pd.DataFrame(
        {
            'first': mkb_names_sc,
        })
    mkb_stuff_sce.to_csv('/Users/gfidarov/Desktop/Python/MKB/mkb.csv')

我所做的是,我使用for循环遍历了代码,并且range()函数正在生成index的列表,并使用format()方法将其放置在url中。

这应该像一种魅力。希望这会有所帮助:)