这是我的代码,只能刮一页,但是我有11000。区别在于他们的ID。
https://www.rlsnet.ru/mkb_index_id_1.htm
https://www.rlsnet.ru/mkb_index_id_2.htm
https://www.rlsnet.ru/mkb_index_id_3.htm
....
https://www.rlsnet.ru/mkb_index_id_11000.htm
如何循环我的代码以刮取全部11000页?如此大量的页面有可能吗?可以将它们放入列表中,然后再进行抓取,但是如果有11000个,则还有很长的路要走。
import requests
from pandas import DataFrame
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
page_sc = requests.get('https://www.rlsnet.ru/mkb_index_id_1.htm')
soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
items_sc = soup_sc.find_all(class_='subcatlist__item')
mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
mkb_stuff_sce = pd.DataFrame(
{
'first': mkb_names_sc,
})
mkb_stuff_sce.to_csv('/Users/gfidarov/Desktop/Python/MKB/mkb.csv')
答案 0 :(得分:0)
您可以像这样即时创建url字符串。您可能还希望在循环的其他每个迭代中使用定时延迟,以免被服务器阻止。
import requests
from pandas import DataFrame
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
path_of_csv = '/Users/gfidarov/Desktop/Python/MKB/mkb.csv'
first_string = 'https://www.rlsnet.ru/mkb_index_id_'
third_string = '.htm'
df = pd.DataFrame(columns=['scraping results'])
try:
for second_string in range(1, 11001):
second_string = str(second_string)
url = first_string + second_string + third_string
page_sc = requests.get(url)
soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
items_sc = soup_sc.find_all(class_='subcatlist__item')
mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
df.append({'scraping results': mkb_names_sc}, ignore_index=True)
df.to_csv(
path_or_buf=path_of_csv
)
except:
# If it fails in the middle of the process, the results won't be lost
path_of_csv = 'backup_' + path_of_csv
df.to_csv(
path_or_buf=path_of_csv
)
print('Failed at index ' + second_string + '. Please start from here again by setting the beginning of the range to this index. A backup was made of the results that were already scraped. You may want to rename the backup to avoid overwriting in the next run.')
答案 1 :(得分:0)
我的方法很简单。我只是遍历上面的代码。
for i in range(1,11001):
page_sc = requests.get('https://www.rlsnet.ru/mkb_index_id_{}.htm'.format(i))
soup_sc = BeautifulSoup(page_sc.content, 'html.parser')
items_sc = soup_sc.find_all(class_='subcatlist__item')
mkb_names_sc = [item_sc.find(class_='subcatlist__link').get_text() for item_sc in items_sc]
mkb_stuff_sce = pd.DataFrame(
{
'first': mkb_names_sc,
})
mkb_stuff_sce.to_csv('/Users/gfidarov/Desktop/Python/MKB/mkb.csv')
我所做的是,我使用for循环遍历了代码,并且range()
函数正在生成index
的列表,并使用format()
方法将其放置在url中。
这应该像一种魅力。希望这会有所帮助:)