我知道有一个日期时间问题 - 不知道在哪里。当我试图抓住过时的旧表时,我返回的数据是今天数据的循环。我想我需要另一个封装循环才能进入旧页面
我如何解决这个问题?
from urlparse import urljoin
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime, timedelta
url = "http://www.wsj.com/mdc/public/page/2_3022-mfsctrscan-moneyflow-{}.html?mod=mdc_pastcalendar"
start = datetime.today()
def only_weekdays_range(start, n):
i = 0
wk_days = {0, 1, 2, 3, 4}
while i != n:
while start.weekday() not in wk_days:
start -= timedelta(days=1)
yield start
i += 1
start -= timedelta(days=1)
for _ in (only_weekdays_range(start, 5)):
print ("data for {}".format(start.strftime("%b %d %y")))
url = url.format(start.strftime('%Y%m%d'))
print 'Retrieving information from: ' + url
print '\n'
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
div_main = soup.find('div', {'id': 'column0'})
table_one = div_main.find('table')
def target_row(tag):
is_row = len(tag.find_all('td')) > 5
row_name = tag.name == 'tr'
return is_row and row_name
rows = table_one.find_all(target_row)[1:]
#print rows
for row in rows:
cells = row.findAll('td')
industry = cells[0].get_text()
data = {
'name' : cells[0].get_text()
print data
print '\n'
答案 0 :(得分:1)
您有两个变量start
start = datetime.today()
def only_weekdays_range(start, n):
您在功能
中更改了本地start
start -= timedelta(days=1)
然后您使用yield
将其返回,然后将其分配给_
中的for _ in ...
,但您不会使用它。您使用未更改的全局变量。
您必须使用_
中的值(即new_date
)
for new_date in (only_weekdays_range(start, 5)):
print ("data for {}".format(new_date.strftime("%b %d %y")))
url = url.format(new_date.strftime('%Y%m%d'))
print 'Retrieving information from: ' + url
但你在功能上有错误的缩进
def only_weekdays_range(start, n):
i = 0
wk_days = {0, 1, 2, 3, 4}
while i != n:
while start.weekday() not in wk_days:
start -= timedelta(days=1)
yield start
i += 1
start -= timedelta(days=1)
工作示例:
from datetime import datetime, timedelta
# --- functions ---
def only_weekdays_range(start, n):
one_day = timedelta(days=1)
for _ in range(n):
while start.weekday() > 4:
start -= one_day
yield start
start -= one_day
# --- main ---
start = datetime.today()
for new_date in only_weekdays_range(start, 10):
print ("data for {}".format(new_date.strftime("%b %d %y %a")))
结果:
data for Feb 03 17 Fri
data for Feb 02 17 Thu
data for Feb 01 17 Wed
data for Jan 31 17 Tue
data for Jan 30 17 Mon
data for Jan 27 17 Fri
data for Jan 26 17 Thu
data for Jan 25 17 Wed
data for Jan 24 17 Tue
data for Jan 23 17 Mon
编辑:与if
代替while
def only_weekdays_range(start, n):
one_day = timedelta(days=1)
for _ in range(n):
weekday = start.weekday()
if weekday > 4:
start -= one_day * (weekday-4)
yield start
start -= one_day
编辑:我看到其他问题
在
url = url.format(...)
你覆盖url
所以在下一个循环中你无法改变它。
使用
full_url = url.format(...)
r = requests.get(full_url)