Question

I'm a python novice trying to preprocess timeseries data so that I can compute some changes as an object moves over a series of nodes and edges so that I can count stops, aggregate them into routes, and understand behavior over the route. Data originally comes in the form of two CSV files (entrance, Typedoc = 0 and clearance, Typedoc = 1, each about 85k rows / 19MB) that I merged into 1 file and performed some dimensionality reduction. I've managed to get it into a multi-index dataframe. Here's a snippet:

    In [1]: movements.head()
    Out[1]: 
                         Typedoc  Port   NRT   GRT      Draft
    Vessname ECDate                                                            
    400 L    2012-01-19        0  2394  2328  7762   4.166667
             2012-07-22        1  2394  2328  7762  17.000000
             2012-10-29        0  2395  2328  7762   6.000000
    A 397    2012-05-27        1  3315  2928  2928  18.833333
             2012-06-01        0  3315  2928  2928   5.250000

I'm interested in understanding the changes for each level as it traverses through its timeseries. I'm going to represent this as a graph eventually. I think I'd really like this data in dictionary form where each entry for a unique Vessname is essentially a tokenized string of stops along the route:

    stops_dict = {'400 L':[
            ['2012-01-19', 0, 2394,  4.166667],
            ['2012-07-22', 1, 2394, 17.000000],
            ['2012-10-29', 0, 2395,  6.000000]
            ]
        }

Where the nested list values are:

    [ECDate, Typedoc, Port, Draft]

If i = 0, then the values I'm interested in are the Dwell and Transit times and the Draft Change, calculated as:

    t_dwell = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]

    d_draft = stops_dict['400 L'][i+1][3] - stops_dict['400 L'][i][3]

    i += 1

and

    t_transit = stops_dict['400 L'][i+1][0] - stops_dict['400 L'][i][0]

assuming all of the dtypes are correct (a big if, since I have not mastered getting pandas to want to parse my dates). I'm then going to extract the links as some form of:

    link = str(stops_dict['400 L'][i][2])+'->'+str(stops_dict['400 L'][i+1][2]),t_transit,d_draft

The t_transit and d_draft values as edge weights. The nodes are list of unique Port values that get assigned the '400 L':[t_dwell,NRT,GRT] k,v pairs (somehow). I haven't figured that out exactly, but I don't think I need help with that process.

I couldn't figure out a simpler way, so I've tried defining a function that required starting over by writing my sorted dataframe out and reading it back in using:

    with open(filename,'sb) as csvfile:
        datareader = csv.reader(csvfile, delimiter=",")
        next(datareader, None)
        <FLOW CONTROL> #based on Typedoc and ECDate values

The function adds to an empty dictionary:

    stops_dict = {}

    def createStopsDict(row):
        #this reads each row in a csv file, 
        #creates a dict entry from row[0]: Vessname if not in dict
        #or appends things after row[0] to the dict entry if Vessname in dict
        ves = row[0]
        if ves in stops_dict:
            stops_dict[ves].append(row[1:])
        else:
            stops_dict[ves]=[row[1:]]
        return

This is an inefficient way of doing things... I could possibly be using iterrows instead of a csv reader... I've looked into melt and unstack and I don't think those are correct... This seems essentially like a groupby effort, but I haven't managed to implement that correctly because of the multi-index...

Is there a simpler, dare I say 'elegant', way to map the dataframe rows based on the multi index value directly into a reusable data structure (right now the dictionary stop_dict).

I'm not tied to the dictionary or its structure, so if there's a better way I am open to suggestions.

Thanks!

UPDATE 2: I think I have this mostly figured out... Beginning with my original data frame movements:

    movements.reset_index().apply(
        lambda x: makeRoute(x.Vessname,
            [x.ECDate,
             x.Typedoc,
             x.Port,
             x.NRT,
             x.GRT,
             x.Draft]),
             axis=1
             )

where:

    routemap = {}
    def makeRoute(Vessname, info):
        if Vessname in routemap:
            route = routemap[Vessname]
            route.append(info)
        else:
            routemap[Vessname] = [info]
        return

returns a dictionary keyed to Vessname in the structure I need to compute things by calling list elements.

Compute values from sequential pandas rows

0 个答案: