将大数据集分解为有组织的索引

时间:2014-12-20 00:35:04

标签: python csv dictionary pandas

我正在尝试从我拥有的数据集创建一个shape_id的索引字典(见下文)。我意识到我可以使用循环(并尝试这样做),但我有一种直觉,认为在大熊猫中有很多方法可以做到这一点,而且计算成本不高。

可能的解决方案: groupbystr.findallstr.extract

字典的结构应如下:

{shape_id: [shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]}

这是我到目前为止的代码:

import pandas as pd

# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_shape_id = shapes['shape_id']
shapes_shape_id_index = list(set(shapes_shape_id))
shapes_shape_pt_sequence = shapes['shape_pt_sequence']
shapes_shape_pt_lat = shapes['shape_pt_lat']
shapes_shape_pt_lon = shapes['shape_pt_lon']

shapes_tuple = []

# add shape index to final dict
for i in range(len(shapes_shape_id_index)):
    shapes_tuple.append([shapes_shape_id_index[i]])

print(shapes_tuple)

以下是shapes.csv要点的LINK

这是一个空的shape_id索引:

[[20992], [20993], [20994], [20995], [20996], [20997], [20998], [20999], [21000], [21001], [21002], [21003], [21004], [21005], [21006], [21007], [21008], [21009], [21010], [21011], [21012], [21013], [21014], [21015], [21016], [21017], [21018], [21019], [21020], [21021], [21022], [21023], [21026], [21027], [21028], [21029], [21030], [21031], [21032], [21033], [21034], [21035], [21036], [21037], [21038], [21039], [21040], [21041], [21042], [21043], [21044], [21045], [21046], [21047], [21048], [21049], [21050], [21051], [21052], [21053], [21054], [21055], [21056], [21057], [21058], [21059], [21060], [21061], [21062], [21063], [21064], [21065], [21066], [21067], [21068], [21069], [21070], [21071], [21072], [21073], [21074], [21075], [21076], [21077], [21078], [21079], [21080], [21081], [21082], [21083], [21084], [21085], [21086], [21087], [21088], [21089], [20958], [20959], [20960], [20961], [20962], [20963], [20964], [20965], [20966], [20967], [20968], [20969], [20970], [20971], [20972], [20973], [20974], [20975], [20976], [20977], [20978], [20979], [20980], [20981], [20982], [20983], [20984], [20985], [20986], [20987], [20988], [20989], [20990], [20991]]

shapes.csv看起来像这样:

shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,is_stop
20958,44.0577683,-123.0873313,1,0
20958,44.0577163,-123.087073,2,0
20958,44.0576286,-123.0867103,3,0
20958,44.0574258,-123.086641,4,0
20958,44.0571421,-123.0866518,5,0
20958,44.0568706,-123.086653,6,0
20958,44.0566161,-123.0867028,7,0
20958,44.0565641,-123.0869733,8,0
20958,44.0565503,-123.0872603,9,0
20958,44.0565536,-123.087631,10,0
20958,44.0565439,-123.0879283,11,0
20958,44.0564661,-123.087894,12,0
20958,44.0565124,-123.0881793,13,0
20958,44.0565181,-123.0884921,14,0
20958,44.0565331,-123.0888668,15,0
20958,44.0565406,-123.0892323,16,0
20958,44.0565406,-123.0896295,17,0
20958,44.0563515,-123.0897096,18,0
20958,44.056073,-123.0897108,19,0
20958,44.0558501,-123.0897,20,0
20958,44.0558358,-123.0897016,21,0
20958,44.0556489,-123.0896861,22,0
20958,44.0554398,-123.0896781,23,0
20958,44.0552033,-123.0896776,24,0
20958,44.0549253,-123.089692,25,0
20958,44.0546778,-123.0897281,26,0
20958,44.0546578,-123.0897326,27,0
20958,44.0546338,-123.0896965,28,0
20958,44.0543988,-123.0896838,29,0
20958,44.0543536,-123.0899543,30,0
20958,44.0543628,-123.0903496,31,0
20958,44.0543668,-123.0906733,32,0
20958,44.0543718,-123.0910178,33,0

例如,在shapes.csv中,20958的最大shape_pt_sequence值为72. 20960的最大shape_pt_sequence值为400,等等。

2 个答案:

答案 0 :(得分:0)

我不知道为什么你需要像[shape_id:[shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]]这样的结果,它对数据选择不是很有用,你可以使用MultiIndex:

shapes = pd.read_csv('shapes.csv')
shapes.set_index(["shape_id", "shape_pt_sequence"], inplace=True)

然后选择20958的所有数据:

print shapes.loc[20958]

选择一个点:

print shapes.loc[20958, 45]

使用范围中的shape_pt_sequence选择20958的数据:

print print shapes.loc[(20958, slice(45, 48)), :]

使用[45,48]中的shape_pt_sequence选择数据:

print shapes.loc[(20958, [45, 48]), :]

如果您真的想要表单,请输入以下代码:

shapes = pd.read_csv('shapes.csv')

def f(df):
    return [df.shape_pt_sequence.tolist(), [df.shape_pt_lat.tolist(), df.shape_pt_lon.tolist()]]

res = shapes.groupby("shape_id").apply(f).to_dict()

答案 1 :(得分:0)

假设你的REAL任务没有验证数据文件,那么读取文件并用循环填充适当的数据结构并不是很笨重,根本不是......

f = open('shapes.csv')
f.next() # skip headers
lines = [line.strip().split(',') for line in f] # f is closed automatically
data = {} ; item = 0
for i, lat, lon, seq, stop in lines:
    i = int(i)
    if i != item:
        item = i
        data[item] = [(float(lat), float(lon))]
    else:
        data[item].append((float(lat), float(lon)))

您的数据文件中不需要stop sentinel,也不需要为每个坐标对显式存储索引。