如何从数据框列创建多个列表

时间:2019-07-16 04:43:34

标签: list dataframe networkx

我有一个尺寸为(5819192,8)的数据框

这是MLB比赛时间表和列名的示例 'date','ym','yr','pair'是重要变量。

Link to the picture

   gubun      date away  hom         result    yr      ym       pair
0    MLB  20180907   SD  CIN   SD 6, CIN 12  2018  201809   [SD,CIN]
1    MLB  20180907  PIT  MIA   PIT 5, MIA 3  2018  201809  [PIT,MIA]
2    MLB  20180907  TOR  CLE   TOR 3, CLE 2  2018  201809  [TOR,CLE]
3    MLB  20180907   TB  BAL   TB 14, BAL 2  2018  201809   [TB,BAL]
4    MLB  20180907  HOU  BOS   HOU 6, BOS 3  2018  201809  [HOU,BOS]
5    MLB  20180907  PHI  NYM   PHI 4, NYM 3  2018  201809  [PHI,NYM]
6    MLB  20180907  STL  DET   STL 3, DET 5  2018  201809  [STL,DET]
7    MLB  20180907  MIN   KC   MIN 10, KC 6  2018  201809   [MIN,KC]
8    MLB  20180907  LAA  CWS   LAA 5, CWS 2  2018  201809  [LAA,CWS]
9    MLB  20180907   SF  MIL    SF 2, MIL 4  2018  201809   [SF,MIL]
10   MLB  20180907  LAD  COL   LAD 4, COL 2  2018  201809  [LAD,COL]
11   MLB  20180907  ATL  ARI   ATL 3, ARI 5  2018  201809  [ATL,ARI]
12   MLB  20180907  TEX  OAK   TEX 4, OAK 8  2018  201809  [TEX,OAK]
13   MLB  20180907  SEA  NYY   SEA 0, NYY 4  2018  201809  [SEA,NYY]
14   MLB  20190502   SD  ATL   SD 11, ATL 2  2019  201905   [SD,ATL]
15   MLB  20190502  NYM  CIN   NYM 1, CIN 0  2019  201905  [NYM,CIN]
16   MLB  20190502  MIN  HOU   MIN 8, HOU 2  2019  201905  [MIN,HOU]
17   MLB  20190502  MIL  COL  MIL 6, COL 11  2019  201905  [MIL,COL]
18   MLB  20190502   TB   KC     TB 3, KC 1  2019  201905    [TB,KC]
19   MLB  20190502  WSH  STL   WSH 2, STL 1  2019  201905  [WSH,STL]
20   MLB  20190502  CWS  BOS   CWS 6, BOS 4  2019  201905  [CWS,BOS]
21   MLB  20190502  TOR  LAA   TOR 2, LAA 6  2019  201905  [TOR,LAA]
22   MLB  20190714  WSH  PHI   WSH 3, PHI 4  2019  201907  [WSH,PHI]
23   MLB  20190714  TOR  NYY   TOR 2, NYY 4  2019  201907  [TOR,NYY]
24   MLB  20190714   TB  BAL    TB 4, BAL 1  2019  201907   [TB,BAL]
25   MLB  20190714  NYM  MIA   NYM 6, MIA 2  2019  201907  [NYM,MIA]
26   MLB  20190714  MIN  CLE   MIN 3, CLE 4  2019  201907  [MIN,CLE]
27   MLB  20190714   SF  MIL    SF 8, MIL 3  2019  201907   [SF,MIL]
28   MLB  20190714  STL  ARI   STL 5, ARI 2  2019  201907  [STL,ARI]
29   MLB  20190714   KC  DET   KC 8, DET 12  2019  201907   [KC,DET]
30   MLB  20190714  PIT  CHC   PIT 3, CHC 8  2019  201907  [PIT,CHC]
31   MLB  20190714  TEX  HOU  TEX 4, HOU 12  2019  201907  [TEX,HOU]
32   MLB  20190714  COL  CIN  COL 10, CIN 9  2019  201907  [COL,CIN]
33   MLB  20190714  OAK  CWS   OAK 3, CWS 2  2019  201907  [OAK,CWS]
34   MLB  20190714  SEA  LAA   SEA 3, LAA 6  2019  201907  [SEA,LAA]
35   MLB  20190714   SD  ATL    SD 1, ATL 4  2019  201907   [SD,ATL]
36   MLB  20190714  LAD  BOS   LAD 7, BOS 4  2019  201907  [LAD,BOS]

我尝试在每个时间段的数据帧列中列出很多列表。

例如,如果“ ”列为2019,则将“ ”列放入2019年的列表名称。

这样,我可以从数据框中列出每个年度列表。

这并不困难。

每月列表也可以。

但是,由于篇幅大小,每天列出一份清单并不容易。

从大约500万次观察中,每天编写约1500个列表在我的代码中似乎非常繁重。处理时间很长。

所以请给我最有效的方法来解决这个问题。

我已经附加了我的代码。

import pandas as pd
import networkx as nx
import numpy as np

df = pd.read_csv("C:\\example(MLB).csv", encoding="cp949")


sort = {}
prd = ['yr', 'ym', 'date']
for j in range(0,3):
    sort[j] = df.sort_values(by = prd[j]).drop_duplicates([prd[j]])[prd[j]]
    sort[prd[j]] = sort.pop(j)

for i in (0,1):


empty = {}
period = 'date'  # Choose among (yr, ym, date)
for i in sort[period]:
    empty[i] = df[df[period].isin([i])]['pair']

# if period = yr or ym, there's no problem. 
# However, in the case of period = date, it take so so long time.

仅供参考,我要列出这么多列表的原因是我想运行Graph软件包。

hits = pd.DataFrame()
G = {}
Tu = {}
for i in empty:
    G[i] = nx.Graph()        
    G[i].add_edges_from(empty[i])
    Tu[i] = nx.hits(G[i], normalized = True)
    hits = hits.append(pd.DataFrame(Tu[i]).T.assign(period = i)) 

我必须计算每日学位,因此,我需要每日清单。

感谢阅读。

0 个答案:

没有答案
相关问题