编辑:不可能解决,需要考虑更好的解决方法。
我正在抓取此网页(http://www.oddsportal.com/american-football/usa/nfl-2017-2018/results/#/page/6/),并试图将游戏日期(页面上的灰色部分)插入每个相应的游戏时间行中。
我正在努力实现这种逻辑。
此页面的抓取日期列表如下...
file_days=[['17 Sep 2017'],['15 Sep 2017'],['12 Sep 2017'], ['11 Sep 2017'],['10 Sep 2017'], ['08 Sep 2017'],['01 Sep 2017'],['31 Aug 2017'],
['28 Aug 2017'],['27 Aug 2017'],['26 Aug 2017'],['25 Aug 2017'],['24 Aug 2017']]
file_days=file_days[::-1]
我正在尝试将这些日期插入包含每个已抓取游戏开始时间的以下数据框中。
import pandas as pd
data = {'game_time': ['23:00','23:30','23:00','00:00','23:00','23:00','23:00','23:30','23:30','00:00','00:00','00:00','01:00','17:00','20:30','00:00','23:00','23:00','23:00','23:00', '23:00','23:30','23:30','23:30','00:00','00:00','00:00','00:00','00:30','01:00','02:00','02:00','00:30','17:00','17:00','17:00','17:00','17:00','17:00','17:00','17:00','20:05','20:25','20:25','00:30','23:10','02:20','00:25','17:00','17:00']}
df = pd.DataFrame.from_dict(data)
到目前为止,我有以下代码,但是我似乎无法弄清楚如果时间过去了新的一天,尝试插入新日期的逻辑。
df.game_time = pd.to_datetime(df.game_time)
df['game'] = df.game_time.dt.strftime('%H:%M')
df['previous_game'] = df.game_time.dt.strftime('%H:%M').shift(1)
df['previous_game'] = df['previous_game'].fillna(str('00:00'))
matchup_day = []
for a,b in zip(df['game'],df['previous_game']):
if a >= b:
matchup_day.append(file_days[0]) #if time of current game is greater than time of previous game than use the current date
else:
matchup_day.append(file_days[1]) #if time of current game is less than time of previous game, then use the next date and delete the most recently used date
file_days.pop(0)
输出如下...
matchup_day
[['24 Aug 2017'],
['24 Aug 2017'],
['25 Aug 2017'],
['26 Aug 2017'],
['26 Aug 2017'],
['26 Aug 2017'],
['26 Aug 2017'],
['26 Aug 2017'],
['26 Aug 2017'],
['27 Aug 2017'],
['27 Aug 2017'],
['27 Aug 2017'],
['27 Aug 2017'],
['27 Aug 2017'],
['27 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['28 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['31 Aug 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['01 Sep 2017'],
['08 Sep 2017'],
['08 Sep 2017'],
['10 Sep 2017'],
['11 Sep 2017'],
['11 Sep 2017'],
['11 Sep 2017']]
此输出明显不正确,因为它在数据帧的第15行或网站的8月28日跳闸。有人对如何改善这种逻辑有任何想法吗?
对于如何实现这一目标,我也持完全不同的想法。 预先感谢您,我对此深感困惑。
答案 0 :(得分:1)
这里不需要手动循环。您可以将系列本身与移位版本进行比较,然后使用pd.Series.cumsum
并通过字典进行映射。
这是一个演示:
from itertools import chain
file_days = [['17 Sep 2017'], ['15 Sep 2017'], ['12 Sep 2017'], ['11 Sep 2017'],
['10 Sep 2017'], ['08 Sep 2017'], ['01 Sep 2017'], ['31 Aug 2017'],
['28 Aug 2017'], ['27 Aug 2017'], ['26 Aug 2017'], ['25 Aug 2017'],
['24 Aug 2017']]
d = dict(enumerate(chain.from_iterable(file_days[::-1])))
df['date'] = (df['game'] < df['game'].shift()).cumsum().map(d)
结果:
print(df[['game', 'date']])
game date
0 23:00 24 Aug 2017
1 23:30 24 Aug 2017
2 23:00 25 Aug 2017
3 00:00 26 Aug 2017
4 23:00 26 Aug 2017
5 23:00 26 Aug 2017
6 23:00 26 Aug 2017
7 23:30 26 Aug 2017
8 23:30 26 Aug 2017
9 00:00 27 Aug 2017
10 00:00 27 Aug 2017
11 00:00 27 Aug 2017
12 01:00 27 Aug 2017
13 17:00 27 Aug 2017
14 20:30 27 Aug 2017
15 00:00 28 Aug 2017
16 23:00 28 Aug 2017
17 23:00 28 Aug 2017
18 23:00 28 Aug 2017
19 23:00 28 Aug 2017
20 23:00 28 Aug 2017
21 23:30 28 Aug 2017
22 23:30 28 Aug 2017
23 23:30 28 Aug 2017
24 00:00 31 Aug 2017
25 00:00 31 Aug 2017
26 00:00 31 Aug 2017
27 00:00 31 Aug 2017
28 00:30 31 Aug 2017
29 01:00 31 Aug 2017
30 02:00 31 Aug 2017
31 02:00 31 Aug 2017
32 00:30 01 Sep 2017
33 17:00 01 Sep 2017
34 17:00 01 Sep 2017
35 17:00 01 Sep 2017
36 17:00 01 Sep 2017
37 17:00 01 Sep 2017
38 17:00 01 Sep 2017
39 17:00 01 Sep 2017
40 17:00 01 Sep 2017
41 20:05 01 Sep 2017
42 20:25 01 Sep 2017
43 20:25 01 Sep 2017
44 00:30 08 Sep 2017
45 23:10 08 Sep 2017
46 02:20 10 Sep 2017
47 00:25 11 Sep 2017
48 17:00 11 Sep 2017
49 17:00 11 Sep 2017