Question

所以我试图从这个网站上删除佛罗里达州法规：www.leg.state.fl.us/Statutes /

到目前为止，我只能抓住第一章： http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html

我注意到网址更改为＆＃34;网址= 0000-0099 / 0002 / 0002.html。＆＃34;当我跳到下一章。我的问题是，我如何以一种可以刮掉所有章节的方式进行编码？（URL 0000-0099的第一部分是章节的范围，所以这种情况将是从第1章到第99章）

我的代码如下：

from bs4 import BeautifulSoup
import urllib2

f = open('C:\Python27\projects\outflieFS_final.txt','w')

def First_part(url):
  thepage = urllib2.urlopen(url)
  soupdata = BeautifulSoup(thepage,'html.parser')
  return soupdata

soup = First_part("http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html")

tableContents = soup.find('div', {'id': 'statutes' })

for data in tableContents.findAll('div'):
   data = data.text.encode("utf-8","ignore")
   data = str(data)+ "\n\n"
   f.write(data)
f.close()

Answer 1

制作循环并使用string formatting形成网址：

base_url = "http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range(1, 100):
    url = base_url.format(chapter=chapter)
    print(url)
    # make a request and parse the page

这会产生以下网址：

http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0002/0002.html
...
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0098/0098.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0099/0099.html

使用BeautifulSoup进行Python网页抓取，如何循环复杂的URL？

1 个答案: