逻辑流程 - 尝试使用BeautifulSoup和CSV Writer遍历网站页面

时间:2015-05-20 20:42:58

标签: python csv beautifulsoup

我似乎无法找出适当的缩进/子句放置来让它循环超过1页。此代码当前打印出的CSV文件很好,但只对第一页打印出来。

#THIS WORKS BUT ONLY PRINTS THE FIRST PAGE

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv

page_num = 1
total_pages = 20

with open("MegaMillions.tsv","w") as f:
    fieldnames = ['date', 'numbers', 'moneyball']
    writer = csv.writer(f, delimiter = '\t')
    writer.writerow(fieldnames)

    while page_num < total_pages:
        page_num = str(page_num)
        soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())

    for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):

        tds = row('td')
        if tds[1].a is not None:
            date = tds[1].a.string.encode("utf-8")
            if tds[3].b is not None:
                uglynumber = tds[3].b.string.split()
                betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
                moneyball = tds[3].strong.string.encode("utf-8")

                writer.writerow([date, betternumber, moneyball])
        page_num = int(page_num)
        page_num += 1

print 'We\'re done here.'

当然,这只会打印最后一页:

#THIS WORKS BUT ONLY PRINTS THE LAST PAGE

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv

page_num = 1
total_pages = 20

while page_num < total_pages:
    page_num = str(page_num)
    soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())

    with open("MegaMillions.tsv","w") as f:
        fieldnames = ['date', 'numbers', 'moneyball']
        writer = csv.writer(f, delimiter = '\t')
        writer.writerow(fieldnames)

        for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):

            tds = row('td')
            if tds[1].a is not None:
                date = tds[1].a.string.encode("utf-8")
                if tds[3].b is not None:
                    uglynumber = tds[3].b.string.split()
                    betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
                    moneyball = tds[3].strong.string.encode("utf-8")

                    writer.writerow([date, betternumber, moneyball])
        page_num = int(page_num)
        page_num += 1

print 'We\'re done here.'

2 个答案:

答案 0 :(得分:2)

第二个代码示例的问题是您每次都要覆盖您的文件。而不是

open("MegaMillions.tsv","w")

使用

open("MegaMillions.tsv","a")

“a”打开要追加的文件,这就是你想要做的事情

答案 1 :(得分:-1)

感谢您的建议,这里有一个有效的变体:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv

page_num = 1
total_pages = 73

with open("MegaMillions.tsv","w") as f:
    fieldnames = ['date', 'numbers', 'moneyball']
    writer = csv.writer(f, delimiter = '\t')
    writer.writerow(fieldnames)

    while page_num <= total_pages:
        page_num = str(page_num)
        soup = BeautifulSoup(urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())

        for row in soup('table',{'bgcolor':'white'})[0].findAll('tr'):

            tds = row('td')
            if tds[1].a is not None:
                date = tds[1].a.string.encode("utf-8")
                if tds[3].b is not None:
                    uglynumber = tds[3].b.string.split()
                    betternumber = [int(uglynumber[i]) for i in range(len(uglynumber)) if i%2==0]
                    moneyball = tds[3].strong.string.encode("utf-8")

                    writer.writerow([date, betternumber, moneyball])
        page_num = int(page_num)
        page_num += 1

print 'We\'re done here.'

在'a'上选择了这个,因为然后为每个页面编写了headerrow。