使用BeautifulSoup进行Python网页抓取,如何循环复杂的URL?

时间:2016-03-21 02:02:31

标签: python web-scraping beautifulsoup

所以我试图从这个网站上删除佛罗里达州法规:www.leg.state.fl.us/Statutes /

到目前为止,我只能抓住第一章: http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html

我注意到网址更改为"网址= 0000-0099 / 0002 / 0002.html。"当我跳到下一章。我的问题是,我如何以一种可以刮掉所有章节的方式进行编码? (URL 0000-0099的第一部分是章节的范围,所以这种情况将是从第1章到第99章)

我的代码如下:

from bs4 import BeautifulSoup
import urllib2

f = open('C:\Python27\projects\outflieFS_final.txt','w')

def First_part(url):
  thepage = urllib2.urlopen(url)
  soupdata = BeautifulSoup(thepage,'html.parser')
  return soupdata

soup = First_part("http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html")

tableContents = soup.find('div', {'id': 'statutes' })

for data in tableContents.findAll('div'):
   data = data.text.encode("utf-8","ignore")
   data = str(data)+ "\n\n"
   f.write(data)
f.close()

1 个答案:

答案 0 :(得分:0)

制作循环并使用string formatting形成网址:

base_url = "http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range(1, 100):
    url = base_url.format(chapter=chapter)
    print(url)
    # make a request and parse the page

这会产生以下网址:

http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0002/0002.html
...
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0098/0098.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0099/0099.html