我尝试为每个链接中显示的所有表创建一个CSV。 This是链接
在链接中有36个链接,因此应生成36个csv。当我运行我的代码时,会创建36个csv,但它们都是空的。我的代码如下:
import csv
import urllib2
from bs4 import BeautifulSoup
first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
l=[]
for t in w:
l.append(t.replace(".","",1))
def record (part) :
url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part)
u=urllib2.urlopen(url)
try:
html=u.read()
finally:
u.close()
soup=BeautifulSoup(html)
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[2:]:
c.append(b.text)
t=(len(c))/2
part=part[:-6]
name=part.replace("/","")
with open('{}.csv'.format(name), 'wb') as f:
writer = csv.writer(f)
for i in range(t):
url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i)
u = urllib2.urlopen(url)
try:
html = u.read()
finally:
u.close()
soup=BeautifulSoup(html)
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
row = [elem.text.encode('utf-8') for elem in tds[:6]]
writer.writerow(row)
使用此for
,我运行创建的函数来为每个链接创建CSV。
for n in l:
record(n)
编辑:根据Alecxe的建议,我改变了代码,并且只为第二个链接工作正常。此外,还有一条消息HTTP Error 404: Not Found
。我在目录中修改,只有两个csv正确创建。
以下是代码:
import csv
import urllib2
from bs4 import BeautifulSoup
def record(part):
soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part)))
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[1:]:
c.append(b.text)
t = (len(links)) / 2
part = part[:-6]
name = part.replace("/", "")
with open('{}.csv'.format(name), 'wb') as f:
writer = csv.writer(f)
for i in range(t):
url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i)
soup = BeautifulSoup(urllib2.urlopen(url))
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
row = [elem.text.encode('utf-8') for elem in tds[:6]]
writer.writerow(row)
soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html"))
links = [tr.a["href"].replace(".", "", 1) for tr in soup.find_all('tr')]
for link in links:
record(link)
答案 0 :(得分:1)
soup.find_all('center')
一无所获。
替换:
c=[]
for n in soup.find_all('center'):
for b in n.find_all('a')[2:]:
c.append(b.text)
使用:
c = [link.text for link in soup.find('table').find_all('a')[2:]]
此外,您可以将urllib2.urlopen(url)
直接传递给BeautifulSoup
构造函数:
soup = BeautifulSoup(urllib2.urlopen(url))
此外,由于行中只有一个链接,因此可以简化获取链接列表的方式。而不是:
w=[]
for q in soup.find_all('tr'):
for link in q.find_all('a'):
w.append(link["href"])
这样做:
links = [tr.a["href"] for tr in soup.find_all('tr')]
另外,请注意您如何命名变量和代码格式。参见: