我需要移动到下一个url链接(每个页面有大约20行我需要提取,然后需要将这些行添加到以下URL的下一组结果中)。
大约有360个网址,我想通过遍历所有网址来提取数据。我的代码如下。我想稍后将它们写入csv文件。任何建议都会非常感激,因为我是Python新手。
from urlparse import urljoin
import requests
from bs4 import BeautifulSoup
import csv
base_url = 'http://cricket.inhs.uiuc.edu/edwipweb/FMPro?-db=nvpassoc.fp5&-format=nvp_search_results.htm&-lay=web%20form&-max=20&-findall='
list_of_rows = []
next_page = 'http://cricket.inhs.uiuc.edu/edwipweb/FMPro?-db=nvpassoc.fp5&-format=nvp_search_results.htm&-lay=web%20form&-max=20&-skip=20&-findall='
while True:
soup = BeautifulSoup(requests.get(next_page).content)
soup.findAll('table')[1].findAll('tr')
for row in soup.findAll('table')[1].findAll('tr'):
list_of_cells = []
for cell in row.findAll('p'):
text = cell.text.replace(' ','')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
try:
next_page = urljoin(base_url, soup.select('/FMPro?-db=nvpassoc.fp5&-format=nvp_search_results.htm&-lay=web%20form&-max=20&-skip=20&-findall=')[1].get('href'))
except IndexError:
break
print list_of_rows
outfile = open("./trialpage.csv","wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)
答案 0 :(得分:0)
我对您的代码进行了一些更改。我用一个名为skip的变量设置原始url。每次跳过
将增加20filter()
你可以采取更大的块,因为你不受屏幕视图的限制,我认为它会更快。尝试max = 200,然后逐步增加200