我正在从WSJ Biggest Gainers网站上抓取数据。我是Python的新手,所以我确信这很简单。我只是找不到明确的答案。
我的代码目前仅从一个页面下载数据,但我希望它返回前几天的数据,例如,find_all
或从图表中选择数据。如何修改代码中的URL来执行此操作?我使用的是Python 3.4.3和bs4。
好消息是前几天的网站网址只有几个不同。
例如, 这是上周五 http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-20150731.html?mod=mdc_pastcalendar
这是上周四
http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-20150730.html?mod=mdc_pastcalendar
理想情况下,我希望能够根据需要更改月份,日期或年份,然后循环显示不同的页面URL以检索我希望的数据。
这是我的代码:
import requests
from bs4 import BeautifulSoup
url = 'http://online.wsj.com/mdc/public/page/2_3021-gainnyse-gainer.html'
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text)
我只是想在过去的X天里循环这个功能。我试过制作一个没有运气的URL列表。制作一个URL列表也是一项工作,所以如果我可以使用%s或%d这样的月份,年份和日期,那就更好了。
答案 0 :(得分:3)
你可以使用开始日期,然后 - =使用timedelta将日期传递给带有str.format和strftime的url的日子:
import requests
from bs4 import BeautifulSoup
from datetime import date,timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"
start = date.today()
for _ in range(5):
url = start_url.format(start.strftime("%Y%m%d"))
start -= timedelta(days=1)
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text)
只需创建您想要的任何日期。如果您想要特定的开始日期,只需创建一个日期时间对象:
import requests
from bs4 import BeautifulSoup
from datetime import datetime,timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"
start = datetime(2015,07,31)
for _ in range(5):
print("Data for {}".format(start.strftime("%b %d %Y")))
url = start_url.format(start.strftime("%Y%m%d"))
start -= timedelta(days=1)
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text.rstrip())
print(" ")
输出:
Data for Jul 31 2015
|
WHAT'S THIS?
|
1
MoneyGram International (MGI)
2
YRC Worldwide (YRCW)
3
Immersion (IMMR)
4
Skywest (SKYW)
5
Vital Therapies (VTL)
6
..........................
Data for Jul 30 2015
|
WHAT'S THIS?
|
1
H&E Equipment Services (HEES)
2
Senomyx (SNMX)
3
eHealth (EHTH)
4
Nutrisystem (NTRI)
5
Open Text (OTEX)
6
LivePerson (LPSN)
7
Sonus Networks (SONS)
8
FormFactor (FORM)
9
Pegasystems (PEGA)
10
Town Sports International Holdings (CLUB)
11
FARO Technologies (FARO)
12
Presbia (LENS)
13
如果您只想包括工作日但仍然需要n
天,那么我们需要添加更多逻辑。
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
start_url = "http://online.wsj.com/mdc/public/page/2_3021-gainnnm-gainer-{}.html?mod=mdc_pastcalendar"
start = datetime(2015, 7, 31)
def only_weekdays_range(start, n):
i = 0
wk_days = {0, 1, 2, 3, 4}
while i != n:
while start.weekday() not in wk_days:
start -= timedelta(days=1)
yield start
i += 1
start -= timedelta(days=1)
for dte in (only_weekdays_range(start, 2)):
print("Data for {}".format(start.strftime("%b %d %Y")))
url = start_url.format(start.strftime("%Y%m%d"))
print(url)
r = requests.get(url) #downloads website html
soup = BeautifulSoup(r.content) #soup calls the data
v_data = soup.select('.text')
for symbol in v_data:
print(symbol.text.rstrip())
print(" ")
only_weekdays_range
将从我们的开始日期开始n
天,不包括周末。您可以通过以下方式执行此操作:print(list(only_weekdays_range(datetime(2015, 7, 26), 2)))
。我们得到[datetime.datetime(2015, 7, 24, 0, 0), datetime.datetime(2015, 7, 23, 0, 0)]
,即24th
周四和23rd
周四,因为我们的开始日是周日26th
如果你想排除假期,那还有更多工作要做。另一种方法只是在从n
返回数据时递减v_data
,但由于各种原因可能会导致无限循环。