无法遍历多个页面来抓取数据

时间:2016-07-07 16:02:05

标签: python html web-scraping beautifulsoup

我需要移动到下一个url链接(每个页面有大约20行我需要提取,然后需要将这些行添加到以下URL的下一组结果中)。

大约有360个网址,我想通过遍历所有网址来提取数据。我的代码如下。我想稍后将它们写入csv文件。任何建议都会非常感激,因为我是Python新手。

     from urlparse import urljoin
     import requests
     from bs4 import BeautifulSoup
     import csv

     base_url = 'http://cricket.inhs.uiuc.edu/edwipweb/FMPro?-db=nvpassoc.fp5&-format=nvp_search_results.htm&-lay=web%20form&-max=20&-findall='
     list_of_rows = []

     next_page = 'http://cricket.inhs.uiuc.edu/edwipweb/FMPro?-db=nvpassoc.fp5&-format=nvp_search_results.htm&-lay=web%20form&-max=20&-skip=20&-findall='

    while True:
      soup = BeautifulSoup(requests.get(next_page).content)
      soup.findAll('table')[1].findAll('tr')
          for row in soup.findAll('table')[1].findAll('tr'):
             list_of_cells = []
               for cell in row.findAll('p'):
               text = cell.text.replace(' ','')
               list_of_cells.append(text)
             list_of_rows.append(list_of_cells)

     try:
         next_page = urljoin(base_url, soup.select('/FMPro?-db=nvpassoc.fp5&-format=nvp_search_results.htm&-lay=web%20form&-max=20&-skip=20&-findall=')[1].get('href'))
     except IndexError:
     break


   print list_of_rows

    outfile = open("./trialpage.csv","wb")
    writer = csv.writer(outfile)
    writer.writerows(list_of_rows)

1 个答案:

答案 0 :(得分:0)

我对您的代码进行了一些更改。我用一个名为skip的变量设置原始url。每次跳过

将增加20
filter()

你可以采取更大的块,因为你不受屏幕视图的限制,我认为它会更快。尝试max = 200,然后逐步增加200