所以我试图从这个网站上删除佛罗里达州法规:www.leg.state.fl.us/Statutes /
到目前为止,我只能抓住第一章: http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html
我注意到网址更改为"网址= 0000-0099 / 0002 / 0002.html。"当我跳到下一章。我的问题是,我如何以一种可以刮掉所有章节的方式进行编码? (URL 0000-0099的第一部分是章节的范围,所以这种情况将是从第1章到第99章)
我的代码如下:
from bs4 import BeautifulSoup
import urllib2
f = open('C:\Python27\projects\outflieFS_final.txt','w')
def First_part(url):
thepage = urllib2.urlopen(url)
soupdata = BeautifulSoup(thepage,'html.parser')
return soupdata
soup = First_part("http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html")
tableContents = soup.find('div', {'id': 'statutes' })
for data in tableContents.findAll('div'):
data = data.text.encode("utf-8","ignore")
data = str(data)+ "\n\n"
f.write(data)
f.close()
答案 0 :(得分:0)
制作循环并使用string formatting形成网址:
base_url = "http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range(1, 100):
url = base_url.format(chapter=chapter)
print(url)
# make a request and parse the page
这会产生以下网址:
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0001/0001.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0002/0002.html
...
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0098/0098.html
http://www.leg.state.fl.us/Statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/0099/0099.html