我正在尝试在python中搜索出现在Web Url的每个链接中的表集,并将表保存在每个链接的csv中。代码如下:
import csv
import urllib2
from bs4 import BeautifulSoup
first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/res20130914/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
s = [ i.replace(".","",1) for i in w ]
l=[]
for t in w:
l.append(t.replace(".","",1))
def record (part) :
url="http://www.admision.unmsm.edu.pe/res20130914{}".format(part)
u=urllib2.urlopen(url)
try:
html=u.read()
finally:
u.close()
soup=BeautifulSoup(html)
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[2:]:
c.append(b.text)
t=(len(c))/2
part=part[:-6]
with open('{}.csv'.format(part), 'wb') as f:
writer = csv.writer(f)
for i in range(t):
url = "http://www.admision.unmsm.edu.pe/res20130914{}{}.html".format(part,i)
u = urllib2.urlopen(url)
try:
html = u.read()
finally:
u.close()
soup=BeautifulSoup(html)
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
row = [elem.text.encode('utf-8') for elem in tds[:6]]
writer.writerow(row)
然后,使用创建的函数,我尝试刮取表并为每个链接创建一个csv。代码如上:
for n in l:
record(n)
不幸的是,结果是错误:
IOError Traceback (most recent call last)
<ipython-input-44-da894016f419> in <module>()
60
61 for n in l:
---> 62 record(n)
63
64
<ipython-input-44-da894016f419> in record(part)
43
44
---> 45 with open('{}.csv'.format(part), 'wb') as f:
46 writer = csv.writer(f)
47 for i in range(t):
IOError: [Errno 2] No such file or directory: '/A/011/.csv'
编辑:
我刚刚发现了真正发生的事情,我想出了一个解决方案。
问题是当我运行记录('/ A / 012 / 0.html')时。我的函数也使用'/A/012/0.html'作为文件的名称。但是,Python将“/”解释为现有目录。
所以,我做了一些改变:
part=part[:-6]
#below is the line where I made the small change.
name=part.replace("/","")
with open('{}.csv'.format(name), 'wb') as f:
我删除了字符/
,网页报废工作正常。
我想知道是否有人建议使用字符/
作为csv文件中的名称。