python beautifulsoup scraping archive pages

时间:2017-02-03 20:00:28

标签: python web-scraping beautifulsoup

我知道有一个日期时间问题 - 不知道在哪里。当我试图抓住过时的旧表时,我返回的数据是今天数据的循环。我想我需要另一个封装循环才能进入旧页面

我如何解决这个问题?

from urlparse import urljoin
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime, timedelta

url = "http://www.wsj.com/mdc/public/page/2_3022-mfsctrscan-moneyflow-{}.html?mod=mdc_pastcalendar"
start = datetime.today()

def only_weekdays_range(start, n):
    i = 0
    wk_days = {0, 1, 2, 3, 4}
    while i != n:
        while start.weekday() not in wk_days:
            start -= timedelta(days=1)
        yield start
    i += 1
    start -= timedelta(days=1)


for _ in (only_weekdays_range(start, 5)):
    print ("data for {}".format(start.strftime("%b %d %y")))
    url = url.format(start.strftime('%Y%m%d'))
    print 'Retrieving information from: ' + url
    print '\n'
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")
    div_main = soup.find('div', {'id': 'column0'})
    table_one = div_main.find('table')
    def target_row(tag):
       is_row = len(tag.find_all('td')) > 5
       row_name = tag.name == 'tr'
       return is_row and row_name

    rows = table_one.find_all(target_row)[1:]
#print rows
    for row in rows:
        cells = row.findAll('td')
        industry = cells[0].get_text()
        data = {
           'name' : cells[0].get_text()
        print data
        print '\n'

1 个答案:

答案 0 :(得分:1)

您有两个变量start

  • 全球start = datetime.today()
  • local def only_weekdays_range(start, n):

您在功能

中更改了本地start
 start -= timedelta(days=1)

然后您使用yield将其返回,然后将其分配给_中的for _ in ...,但您不会使用它。您使用未更改的全局变量。

您必须使用_中的值(即new_date

for new_date in (only_weekdays_range(start, 5)):
    print ("data for {}".format(new_date.strftime("%b %d %y")))
    url = url.format(new_date.strftime('%Y%m%d'))
    print 'Retrieving information from: ' + url

但你在功能上有错误的缩进

def only_weekdays_range(start, n):
    i = 0
    wk_days = {0, 1, 2, 3, 4}
    while i != n:
        while start.weekday() not in wk_days:
            start -= timedelta(days=1)
        yield start
        i += 1
        start -= timedelta(days=1)

工作示例:

from datetime import datetime, timedelta

# --- functions ---

def only_weekdays_range(start, n):
    one_day = timedelta(days=1)
    for _ in range(n):
        while start.weekday() > 4:
            start -= one_day
        yield start
        start -= one_day

# --- main ---

start = datetime.today()

for new_date in only_weekdays_range(start, 10):
    print ("data for {}".format(new_date.strftime("%b %d %y %a")))

结果:

data for Feb 03 17 Fri
data for Feb 02 17 Thu
data for Feb 01 17 Wed
data for Jan 31 17 Tue
data for Jan 30 17 Mon
data for Jan 27 17 Fri
data for Jan 26 17 Thu
data for Jan 25 17 Wed
data for Jan 24 17 Tue
data for Jan 23 17 Mon

编辑:if代替while

def only_weekdays_range(start, n):
    one_day = timedelta(days=1)
    for _ in range(n):
        weekday = start.weekday()
        if weekday > 4:
            start -= one_day * (weekday-4)
        yield start
        start -= one_day

编辑:我看到其他问题

 url = url.format(...) 

你覆盖url所以在下一个循环中你无法改变它。

使用

full_url = url.format(...)

r = requests.get(full_url)