在网刮的空的CSV - Python

时间:2014-03-10 05:39:33

标签: python csv web-scraping beautifulsoup

我尝试为每个链接中显示的所有表创建一个CSV。 This是链接

在链接中有36个链接,因此应生成36个csv。当我运行我的代码时,会创建36个csv,但它们都是空的。我的代码如下:

import csv
import urllib2
from bs4 import BeautifulSoup




first=urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html").read()
soup=BeautifulSoup(first)
w=[]
for q in soup.find_all('tr'):
    for link in q.find_all('a'):
        w.append(link["href"])



l=[]

for t in w:
    l.append(t.replace(".","",1))





def record (part) :


        url="http://www.admision.unmsm.edu.pe/admisionsabado".format(part)
        u=urllib2.urlopen(url)
        try:
            html=u.read()
        finally:
            u.close()
        soup=BeautifulSoup(html)
        c=[]
        for n in soup.find_all('center'):
            for b in n.find_all('a')[2:]:
                c.append(b.text)

        t=(len(c))/2
        part=part[:-6]
        name=part.replace("/","")


        with open('{}.csv'.format(name), 'wb') as f:
            writer = csv.writer(f)
            for i in range(t):
                url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part,i)
                u = urllib2.urlopen(url)
                try:
                    html = u.read()
                finally:
                    u.close()
                soup=BeautifulSoup(html)
                for tr in soup.find_all('tr')[1:]:
                    tds = tr.find_all('td')
                    row = [elem.text.encode('utf-8') for elem in tds[:6]]
                    writer.writerow(row)

使用此for,我运行创建的函数来为每个链接创建CSV。

 for n in l:
        record(n) 

编辑:根据Alecxe的建议,我改变了代码,并且只为第二个链接工作正常。此外,还有一条消息HTTP Error 404: Not Found。我在目录中修改,只有两个csv正确创建。

以下是代码:

import csv
import urllib2
from bs4 import BeautifulSoup



    def record(part):
        soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado".format(part)))
        c=[]
        for n in soup.find_all('center'):
            for b in n.find_all('a')[1:]:
                c.append(b.text)

        t = (len(links)) / 2
        part = part[:-6]
        name = part.replace("/", "")

        with open('{}.csv'.format(name), 'wb') as f:
            writer = csv.writer(f)
            for i in range(t):
                url = "http://www.admision.unmsm.edu.pe/admisionsabado{}{}.html".format(part, i)
                soup = BeautifulSoup(urllib2.urlopen(url))
                for tr in soup.find_all('tr')[1:]:
                    tds = tr.find_all('td')
                    row = [elem.text.encode('utf-8') for elem in tds[:6]]
                    writer.writerow(row)


    soup = BeautifulSoup(urllib2.urlopen("http://www.admision.unmsm.edu.pe/admisionsabado/A.html"))
    links = [tr.a["href"].replace(".", "", 1) for tr in soup.find_all('tr')]

    for link in links:
        record(link)

1 个答案:

答案 0 :(得分:1)

soup.find_all('center')一无所获。

替换:

c=[]
for n in soup.find_all('center'):
    for b in n.find_all('a')[2:]:
        c.append(b.text)

使用:

c = [link.text for link in soup.find('table').find_all('a')[2:]]

此外,您可以将urllib2.urlopen(url)直接传递给BeautifulSoup构造函数:

soup = BeautifulSoup(urllib2.urlopen(url))

此外,由于行中只有一个链接,因此可以简化获取链接列表的方式。而不是:

w=[]
for q in soup.find_all('tr'):
    for link in q.find_all('a'):
        w.append(link["href"])

这样做:

links = [tr.a["href"] for tr in soup.find_all('tr')]

另外,请注意您如何命名变量和代码格式。参见: