Pandas cvs在startdate到enddate之间找到日期

时间:2015-12-12 16:43:45

标签: python csv pandas

ID    ArCityArCountry         DptCityDptCountry      DateDpt    DateAr
1922  ParisFrance             NewYorkUnitedState     2008-03-10 2001-02-02
1002  LosAngelesUnitedState   California UnitedState 2008-03-10 2008-12-01
1901  ParisFrance             LagosNigeria           2001-03-05 2001-02-02
1922  ParisFrance             NewYorkUnitedState     2011-02-03 2008-12-01
1002  ParisFrance             CaliforniaUnitedState  2003-03-04 2002-03-04
1099  ParisFrance             BeijingChina           2011-02-03 2009-02-04
1901  LosAngelesUnitedState   ParisFrance            2001-03-05 2001-02-02

我想将它们分组为ParisFranceLosAngelesUnitedState,然后DPTCITYDPTCOUNTRY(相同),然后想要考虑日期(即DateAr和{{1} })。

例如 DateDpt [它应列出ParisFranceIDDateDpt所有与DateAr有关的内容,而无需重复编写ParisFrance,但可以列出那些与它有关] ParisFrance [LosAngelesUnitedStateIDDateDpt列出DateAr所有与LosAngelesUnitedState无关但不重复LosAngelesUnitedState的列表,但可以列出那些与它有关]]

import pandas as pd
import datetime
from pandas_datareader import data, wb
import csv
import numpy as np

out= open("testfile.csv", "rb")
data = csv.reader(out)
#df = pd.read_csv('testfile.csv')
data = [[row[0],row[1] + row[2],row[3] + row[4], row[5],row[6]] for row in data]
out.close()
print data
out=open("data.csv", "wb")
output = csv.writer(out)
    for row in data:
    output.writerow(row)

out.close()

df = pd.read_csv('data.csv')
for DateDpt, DateAr in df.iteritems():
    df.DateDpt = pd.to_datetime(df.DateDpt, format='%Y-%m-%d')
    df.DateAr = pd.to_datetime(df.DateAr, format='%Y-%m-%d')
print df

df[(df.DateAr <= df.DateDpt)]
    .sort(['ID','DateAr','DateDpt'],
        ascending[1,1,1,0])
    .groupby(['DptCityDptCountry','ArCityArCountry'])
   .first().reset_index()

期望的输出:

ParisFrance 
  [1922, NewYorkUnitedState, 2008-03-10, 2001-02-02], [1901,LagosNigeria, 2001-03-05 2001-02-02], [1922,NewYorkUnitedState,2011-02-03, 2008-12-01]

LosAngelesUnitedState
  [1901,ParisFrance,2001-03-05, 2001-02-02]

1 个答案:

答案 0 :(得分:0)

听起来像是在寻找类似的东西:

df['DateAr'] = pd.to_datetime(df['DateAr'])
df['DateDpt'] = pd.to_datetime(df['DateDpt'])

dept_cities = df.groupby('ArCityArCountry')

for city, departures in dept_cities:
    print(city)
    print([list(r) for r in departures.loc[:, ['ID', 'DptCityDptCountry', 'DateDpt', 'DateAr']].to_records()])

可让您接近您指明的格式 - 当然可以进一步调整print()

LosAngelesUnitedState
[[1, 1002, 'California UnitedState', numpy.datetime64('2008-03-09T18:00:00.000000000-0600'), numpy.datetime64('2008-11-30T18:00:00.000000000-0600')], [6, 1901, 'ParisFrance', numpy.datetime64('2001-03-04T18:00:00.000000000-0600'), numpy.datetime64('2001-02-01T18:00:00.000000000-0600')]]
ParisFrance
[[0, 1922, 'NewYorkUnitedState', numpy.datetime64('2008-03-09T18:00:00.000000000-0600'), numpy.datetime64('2001-02-01T18:00:00.000000000-0600')], [2, 1901, 'LagosNigeria', numpy.datetime64('2001-03-04T18:00:00.000000000-0600'), numpy.datetime64('2001-02-01T18:00:00.000000000-0600')], [3, 1922, 'NewYorkUnitedState', numpy.datetime64('2011-02-02T18:00:00.000000000-0600'), numpy.datetime64('2008-11-30T18:00:00.000000000-0600')], [4, 1002, 'CaliforniaUnitedState', numpy.datetime64('2003-03-03T18:00:00.000000000-0600'), numpy.datetime64('2002-03-03T18:00:00.000000000-0600')], [5, 1099, 'BeijingChina', numpy.datetime64('2011-02-02T18:00:00.000000000-0600'), numpy.datetime64('2009-02-03T18:00:00.000000000-0600')]]