从字典中删除重复的pandas数据帧

时间:2016-07-10 09:31:11

标签: python pandas dictionary duplicates

我有一个字典,其中包含具有相同列名的pandas数据框,我希望删除具有相同值和行ID的重复数据框。

我们假设这是我的数据框词典:

>>> dd[0]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[1]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[2]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[3]
              Origin           Destination                Time
1           New York           Los Angeles 2016-03-28 04:00:00
2           Los Angeles             Boston 2016-03-28 06:00:00
3             Boston              New York 2016-03-28 08:00:00
>>> dd[4]
              Origin           Destination                Time
1           New York           Los Angeles 2016-03-28 04:00:00
2           Los Angeles             Boston 2016-03-28 06:00:00
3             Boston              New York 2016-03-28 08:00:00
>>> dd[5]
              Origin           Destination                Time
3             Boston              New York 2016-03-28 08:00:00
4           New York           Los Angeles 2016-03-28 12:00:00
>>> dd[6]
              Origin           Destination                Time
3             Boston              New York 2016-03-28 08:00:00
4           New York           Los Angeles 2016-03-28 12:00:00

我希望结果看起来像这样:

>>> dd[0]
              Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston           Los Angeles 2016-03-28 06:00:00
>>> dd[3]
              Origin           Destination                Time
1           New York           Los Angeles 2016-03-28 04:00:00
2           Los Angeles             Boston 2016-03-28 06:00:00
3             Boston              New York 2016-03-28 08:00:00
>>> dd[5]
              Origin           Destination                Time
3             Boston              New York 2016-03-28 08:00:00
4           New York           Los Angeles 2016-03-28 12:00:00

这是我的代码,导致上述例子:

# Load data as pandas data frame
data = pd.read_csv("website.txt", names = ["Time", "Origin", `"Destination"])`
data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
# Reverse data frame by index to loop backwards
data = data.reindex(index=df.index[::-1]) 
dd = {} 
for i, e in reverse.iterrows(): 
    dd[i] = data[ (data['Time'] > e['Time']-pd.Timedelta('4 hours')) & (data['Time'] < e['Time'] + pd.Timedelta('4 hours'))]

原文:

{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T02:00:00Z", "Origin": "New York", "Destination": "Boston"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T04:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T06:00:00Z", "Origin": "Boston", "Destination": "Los Angeles"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T08:00:00Z", "Origin": "Boston", "Destination": "New York"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}
{"Time": "2016-03-28T12:00:00Z", "Origin": "New York", "Destination": "Los Angeles"}

1 个答案:

答案 0 :(得分:1)

一个班轮

{k: v.unstack() for k, v in pd.DataFrame({k: v.stack() for k, v in dd.iteritems()}).T.drop_duplicates().iterrows()}

解释版本

# iterate through key, value pairs of dictionary,
# stacking each dataframe into a series so that we
# can pass the resulting dataframe into the pd.DataFrame constructor.
df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
# Each column is now one key, value pair from the original dictionary
# Transpose and drop duplicates
df2 = df1.T.drop_duplicates()
# reverse the original stacking and convert back to dictionary
# we could have used df2.T.iteritems() but df2.iterrows() took
# one fewer operations and fewer characters to type.
dd_ = {k: v.unstack() for k, v in df2.iterrows()}

for k, v in dd_.iteritems():
    print 'key {}:'.format(k)
    print v
    print '-' * 10

key 0:
   a  b
0  1  2
1  3  4
----------
key 2:
   a  b
0  2  3
1  4  5
----------

设置以获得与我相同的结果(复制并粘贴此内容)

from StringIO import StringIO
import pandas as pd

text0 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2             Boston           Los Angeles  2016-03-28 06:00:00"""


text1 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2             Boston           Los Angeles  2016-03-28 06:00:00"""

text2 = """              Origin           Destination                 Time
0           New York                Boston  2016-03-28 02:00:00
1           New York           Los Angeles  2016-03-28 04:00:00
2           Los Angeles             Boston  2016-03-28 06:00:00"""

dd = {}

dd[0] = pd.read_csv(StringIO(text0), sep='\s{2,}', index_col=0, engine='python')
dd[0].Time = pd.to_datetime(dd[0].Time)

dd[1] = pd.read_csv(StringIO(text1), sep='\s{2,}', index_col=0, engine='python')
dd[1].Time = pd.to_datetime(dd[1].Time)

dd[2] = pd.read_csv(StringIO(text2), sep='\s{2,}', index_col=0, engine='python')
dd[2].Time = pd.to_datetime(dd[2].Time)

# Then run solutions above:

df1 = pd.DataFrame({k: v.stack() for k, v in dd.iteritems()})
df2 = df1.T.drop_duplicates()
dd_ = {k: v.unstack() for k, v in df2.iterrows()}

for k, v in dd_.iteritems():
    print 'key {}:'.format(k)
    print v
    print '-' * 10

你应该得到这个:

key 0:
     Origin  Destination                 Time
0  New York       Boston  2016-03-28 02:00:00
1  New York  Los Angeles  2016-03-28 04:00:00
2    Boston  Los Angeles  2016-03-28 06:00:00
----------
key 2:
        Origin  Destination                 Time
0     New York       Boston  2016-03-28 02:00:00
1     New York  Los Angeles  2016-03-28 04:00:00
2  Los Angeles       Boston  2016-03-28 06:00:00
----------

版本

import sys
import pandas as pd
print sys.version
print pd.__version__

2.7.11 |Anaconda custom (x86_64)| (default, Dec  6 2015, 18:57:58) 
[GCC 4.2.1 (Apple Inc. build 5577)]
0.18.1