这是我第一次使用Python(之前我使用过R),所以请关注这个问题。基本上,我想使用for循环来比较每行中的datetime
值与pandas datetime
数据帧中其他行中的所有其他pd
值,如果时间差异为4小时或更短时间将这些行存储到子集对象df
中以供稍后处理。但是,我不确定从哪里开始。
我们假设这是我的数据集:
Origin Destination Time
0 New York Cairo 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Hawaii 2016-03-28 06:00:00
3 New York Boston 2016-03-28 08:00:00
4 Los Angeles Boston 2016-03-28 10:00:00
5 Los Angeles Hawaii 2016-03-28 12:00:00
这就是结果应该是这样的:
>>>df[0]
Origin Destination Time
0 New York Cairo 2016-03-28 02:00:00
>>>df[1]
Origin Destination Time
0 New York Cairo 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
>>>df[2]
Origin Destination Time
0 New York Cairo 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Hawaii 2016-03-28 06:00:00
>>>df[3]
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Hawaii 2016-03-28 06:00:00
3 New York Boston 2016-03-28 08:00:00
>>>df[5]
Origin Destination Time
3 New York Boston 2016-03-28 08:00:00
4 Los Angeles Boston 2016-03-28 10:00:00
5 Los Angeles Hawaii 2016-03-28 12:00:00
我不明白怎么弄这个。
答案 0 :(得分:4)
如果你想要一个没有任何循环的纯熊猫解决方案,你可以这样做:
以下是一个例子:
# Load file
data = pd.read_csv("abc.csv", delimiter="\t")
data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
data["Ignore"] = 1
data = data.reset_index()
# cross-join
merged = pd.merge(data, data, how="outer", on="Ignore")
# this is the magic
merged = merged[(merged["Time_x"] - merged["Time_y"]).abs() < pd.Timedelta("4 hours")]
# so you have some structure
groups = merged.groupby("index_x").apply(lambda x : x.set_index("index_y")[["Origin_y", "Destination_y", "Time_y"]])
这会给你一个这样的结果:
Origin_y Destination_y Time_y
index_x index_y
0 0 New York Cairo 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
1 0 New York Cairo 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Hawaii 2016-03-28 06:00:00
2 1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Hawaii 2016-03-28 06:00:00
3 New York Boston 2016-03-28 08:00:00
3 2 Boston Hawaii 2016-03-28 06:00:00
3 New York Boston 2016-03-28 08:00:00
...
您可以像这样访问各个行:
> groups.T[0].T
Origin_y Destination_y Time_y
index_y
0 New York Cairo 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
答案 1 :(得分:2)
从这开始:
Origin Destination Time
0 New York Cairo 2016-03-28 00:00:00
1 New York Los Angeles 2016-03-28 02:00:00
2 Boston Hawaii 2016-03-28 04:00:00
3 New York Boston 2016-03-28 06:00:00
4 Los Angeles Boston 2016-03-28 08:00:00
5 Los Angeles Hawaii 2016-03-28 10:00:00
使用dict存储您的DataFrame,然后使用Index of来访问Dict 数据帧。
NewDict = {}
for i, e in df.iterrows():
NewDict[i] = df[ (df['Time'] > e['Time']-pd.Timedelta('4 hours')) & (df['Time'] < e['Time'] + pd.Timedelta('4 hours'))]
NewDict[0]
Origin Destination Time
0 New York Cairo 2016-03-28 00:00:00
1 New York Los Angeles 2016-03-28 02:00:00
NewDict[4]
Origin Destination Time
3 New York Boston 2016-03-28 06:00:00
4 Los Angeles Boston 2016-03-28 08:00:00
5 Los Angeles Hawaii 2016-03-28 10:00:00
获得计数:
for k, v in NewDict.iteritems():
print "Key" ,k,"has" , len(v), "items"
Key 0 has 2 items
Key 1 has 3 items
Key 2 has 3 items
Key 3 has 3 items
Key 4 has 3 items
Key 5 has 2 items
编辑以反向循环:
reverse = df.reindex(index=df.index[::-1])
revSorted = {}
for i, e in reverse.iterrows():
revSorted[i] = reverse[ (reverse['Time'] > e['Time']-pd.Timedelta('4 hours')) & (reverse['Time'] < e['Time'] + pd.Timedelta('4 hours'))]
答案 2 :(得分:1)
循环的逻辑是:
df = []
for i, row in enumerate(rows):
df.append([row])
try:
for next_row in rows[i + 1:]:
if abs(row['Time'] - next_row['Time']) < timedelta(hours=4):
df[i].append(next_row)
else:
break
except IndexError:
continue