我有一个字典,其中包含具有相同列名的pandas数据框,我希望删除具有相同值和行ID的重复数据框。
我们假设这是我的数据框词典:
>>> dd[0]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[1]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[2]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[3]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[4]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[5]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
>>> dd[6]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
我希望结果看起来像这样:
>>> dd[0]
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
>>> dd[3]
Origin Destination Time
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
3 Boston New York 2016-03-28 08:00:00
>>> dd[5]
Origin Destination Time
3 Boston New York 2016-03-28 08:00:00
4 New York Los Angeles 2016-03-28 12:00:00
这是我的代码,导致上述例子:
# Load data as pandas data frame
data = pd.read_csv("website.txt", names = ["Time", "Origin", `"Destination"])`
data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
# Reverse data frame by index to loop backwards
data = data.reindex(index=df.index[::-1])
dd = {}
for i, e in reverse.iterrows():
dd[i] = data[ (data['Time'] > e['Time']-pd.Timedelta('4 hours')) & (data['Time'] < e['Time'] + pd.Timedelta('4 hours'))]
原文:
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
答案 0 :(得分:1)
{k: v.unstack() for k, v in pd.DataFrame({k: v.stack() for k, v in dd.iteritems()}).T.drop_duplicates().iterrows()}
# iterate through key, value pairs of dictionary,
# stacking each dataframe into a series so that we
# can pass the resulting dataframe into the pd.DataFrame constructor.
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
# Each column is now one key, value pair from the original dictionary
# Transpose and drop duplicates
df2 = df1.T.drop_duplicates()
# reverse the original stacking and convert back to dictionary
# we could have used df2.T.iteritems() but df2.iterrows() took
# one fewer operations and fewer characters to type.
dd_ = {k: v.unstack() for k, v in df2.iterrows()}
for k, v in dd_.iteritems():
print 'key {}:'.format(k)
print v
print '-' * 10
key 0:
a b
0 1 2
1 3 4
----------
key 2:
a b
0 2 3
1 4 5
----------
from StringIO import StringIO
import pandas as pd
text0 = """ Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00"""
text1 = """ Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00"""
text2 = """ Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00"""
dd = {}
dd[0] = pd.read_csv(StringIO(text0), sep='\s{2,}', index_col=0, engine='python')
dd[0].Time = pd.to_datetime(dd[0].Time)
dd[1] = pd.read_csv(StringIO(text1), sep='\s{2,}', index_col=0, engine='python')
dd[1].Time = pd.to_datetime(dd[1].Time)
dd[2] = pd.read_csv(StringIO(text2), sep='\s{2,}', index_col=0, engine='python')
dd[2].Time = pd.to_datetime(dd[2].Time)
# Then run solutions above:
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
df2 = df1.T.drop_duplicates()
dd_ = {k: v.unstack() for k, v in df2.iterrows()}
for k, v in dd_.iteritems():
print 'key {}:'.format(k)
print v
print '-' * 10
你应该得到这个:
key 0:
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Boston Los Angeles 2016-03-28 06:00:00
----------
key 2:
Origin Destination Time
0 New York Boston 2016-03-28 02:00:00
1 New York Los Angeles 2016-03-28 04:00:00
2 Los Angeles Boston 2016-03-28 06:00:00
----------
import sys
import pandas as pd
print sys.version
print pd.__version__
2.7.11 |Anaconda custom (x86_64)| (default, Dec 6 2015, 18:57:58)
[GCC 4.2.1 (Apple Inc. build 5577)]
0.18.1