我有一个尺寸为(5819192,8)的数据框
这是MLB比赛时间表和列名的示例 'date','ym','yr','pair'是重要变量。
gubun date away hom result yr ym pair
0 MLB 20180907 SD CIN SD 6, CIN 12 2018 201809 [SD,CIN]
1 MLB 20180907 PIT MIA PIT 5, MIA 3 2018 201809 [PIT,MIA]
2 MLB 20180907 TOR CLE TOR 3, CLE 2 2018 201809 [TOR,CLE]
3 MLB 20180907 TB BAL TB 14, BAL 2 2018 201809 [TB,BAL]
4 MLB 20180907 HOU BOS HOU 6, BOS 3 2018 201809 [HOU,BOS]
5 MLB 20180907 PHI NYM PHI 4, NYM 3 2018 201809 [PHI,NYM]
6 MLB 20180907 STL DET STL 3, DET 5 2018 201809 [STL,DET]
7 MLB 20180907 MIN KC MIN 10, KC 6 2018 201809 [MIN,KC]
8 MLB 20180907 LAA CWS LAA 5, CWS 2 2018 201809 [LAA,CWS]
9 MLB 20180907 SF MIL SF 2, MIL 4 2018 201809 [SF,MIL]
10 MLB 20180907 LAD COL LAD 4, COL 2 2018 201809 [LAD,COL]
11 MLB 20180907 ATL ARI ATL 3, ARI 5 2018 201809 [ATL,ARI]
12 MLB 20180907 TEX OAK TEX 4, OAK 8 2018 201809 [TEX,OAK]
13 MLB 20180907 SEA NYY SEA 0, NYY 4 2018 201809 [SEA,NYY]
14 MLB 20190502 SD ATL SD 11, ATL 2 2019 201905 [SD,ATL]
15 MLB 20190502 NYM CIN NYM 1, CIN 0 2019 201905 [NYM,CIN]
16 MLB 20190502 MIN HOU MIN 8, HOU 2 2019 201905 [MIN,HOU]
17 MLB 20190502 MIL COL MIL 6, COL 11 2019 201905 [MIL,COL]
18 MLB 20190502 TB KC TB 3, KC 1 2019 201905 [TB,KC]
19 MLB 20190502 WSH STL WSH 2, STL 1 2019 201905 [WSH,STL]
20 MLB 20190502 CWS BOS CWS 6, BOS 4 2019 201905 [CWS,BOS]
21 MLB 20190502 TOR LAA TOR 2, LAA 6 2019 201905 [TOR,LAA]
22 MLB 20190714 WSH PHI WSH 3, PHI 4 2019 201907 [WSH,PHI]
23 MLB 20190714 TOR NYY TOR 2, NYY 4 2019 201907 [TOR,NYY]
24 MLB 20190714 TB BAL TB 4, BAL 1 2019 201907 [TB,BAL]
25 MLB 20190714 NYM MIA NYM 6, MIA 2 2019 201907 [NYM,MIA]
26 MLB 20190714 MIN CLE MIN 3, CLE 4 2019 201907 [MIN,CLE]
27 MLB 20190714 SF MIL SF 8, MIL 3 2019 201907 [SF,MIL]
28 MLB 20190714 STL ARI STL 5, ARI 2 2019 201907 [STL,ARI]
29 MLB 20190714 KC DET KC 8, DET 12 2019 201907 [KC,DET]
30 MLB 20190714 PIT CHC PIT 3, CHC 8 2019 201907 [PIT,CHC]
31 MLB 20190714 TEX HOU TEX 4, HOU 12 2019 201907 [TEX,HOU]
32 MLB 20190714 COL CIN COL 10, CIN 9 2019 201907 [COL,CIN]
33 MLB 20190714 OAK CWS OAK 3, CWS 2 2019 201907 [OAK,CWS]
34 MLB 20190714 SEA LAA SEA 3, LAA 6 2019 201907 [SEA,LAA]
35 MLB 20190714 SD ATL SD 1, ATL 4 2019 201907 [SD,ATL]
36 MLB 20190714 LAD BOS LAD 7, BOS 4 2019 201907 [LAD,BOS]
我尝试在每个时间段的数据帧列中列出很多列表。
例如,如果“ 年”列为2019,则将“ 对”列放入2019年的列表名称。
这样,我可以从数据框中列出每个年度列表。
这并不困难。
每月列表也可以。
但是,由于篇幅大小,每天列出一份清单并不容易。
从大约500万次观察中,每天编写约1500个列表在我的代码中似乎非常繁重。处理时间很长。
所以请给我最有效的方法来解决这个问题。
我已经附加了我的代码。
import pandas as pd
import networkx as nx
import numpy as np
df = pd.read_csv("C:\\example(MLB).csv", encoding="cp949")
sort = {}
prd = ['yr', 'ym', 'date']
for j in range(0,3):
sort[j] = df.sort_values(by = prd[j]).drop_duplicates([prd[j]])[prd[j]]
sort[prd[j]] = sort.pop(j)
for i in (0,1):
empty = {}
period = 'date' # Choose among (yr, ym, date)
for i in sort[period]:
empty[i] = df[df[period].isin([i])]['pair']
# if period = yr or ym, there's no problem.
# However, in the case of period = date, it take so so long time.
仅供参考,我要列出这么多列表的原因是我想运行Graph
软件包。
hits = pd.DataFrame()
G = {}
Tu = {}
for i in empty:
G[i] = nx.Graph()
G[i].add_edges_from(empty[i])
Tu[i] = nx.hits(G[i], normalized = True)
hits = hits.append(pd.DataFrame(Tu[i]).T.assign(period = i))
我必须计算每日学位,因此,我需要每日清单。
感谢阅读。