合并DataFrames与订购标准

时间:2017-04-13 18:38:47

标签: python pandas

previous question中,我问如何匹配此DataFrame source中的值:

     car_id     lat     lon
0    100        10.0    15.0
1    100        12.0    10.0
2    100        13.0    09.0
3    110        23.0    08.0
4    110        13.0    09.0
5    110        12.0    10.0
6    110        12.0    02.0
7    120        11.0    11.0
8    120        12.0    10.0
9    120        13.0    09.0
10   120        14.0    08.0
11   130        12.0    10.0

并且仅保留那些其坐标位于第二个DataFrame中的人coords

     lat     lon
0    12.0    10.0
1    13.0    09.0

但是这次我想匹配每个获得car_id的人:

  • 来自coords
  • 的所有值
  • 具有相同的顺序

因此生成的DataFrame result将为:

     car_id
1    100
2    120

# 110 has all the values from coords, but not in the same order
# 130 doesn't have all the values from coords

有没有办法以矢量化的方式实现这个结果,避免经历很多循环和条件?

2 个答案:

答案 0 :(得分:1)

这不是很好,但如果你做了这样的事情会怎么样:

df2 = DataFrame(df, copy=True)
df2[['lat2', 'lon2']] = df[['lat', 'lon']].shift(-1)
df2.set_index(['lat', 'lon', 'lat2', 'lon2'], inplace=True)
print(df2.loc[(12, 10, 13, 9)].reset_index(drop=True))

   car_id
0     100
1     120

这就是一般情况:

raw_data = {'car_id': [100, 100, 100, 110, 110, 110, 110, 120, 120, 120, 120, 130],
            'lat': [10, 12, 13, 23, 13, 12, 12, 11, 12, 13, 14, 12],
            'lon': [15, 10, 9, 8, 9, 10, 2, 11, 10, 9, 8, 10],
           }
df = pd.DataFrame(raw_data, columns = ['car_id', 'lat', 'lon'])

raw_data = {
             'lat': [10, 12, 13],
             'lon': [15, 10, 9],
           }

coords = pd.DataFrame(raw_data, columns = ['lat', 'lon'])

def submatch(df, match):
    df2 = DataFrame(df['car_id'])
    for x in range(match.shape[0]):
        df2[['lat{}'.format(x), 'lon{}'.format(x)]] = df[['lat', 'lon']].shift(-x)

    n = match.shape[0]
    cols = [item for sublist in
        [['lat{}'.format(x), 'lon{}'.format(x)] for x in range(n)]
        for item in sublist]

    df2.set_index(cols, inplace=True)
    return df2.loc[tuple(match.stack().values)].reset_index(drop=True)

print(submatch(df, coords))

   car_id
0     100

答案 1 :(得分:1)

计划

  • 我们将groupby 'car_id'并评估每个子集
  • inner merge后,我们应该看到两件事
    1. 生成的合并数据框应具有与coords
    2. 相同的值
    3. 合并后的数据框应涵盖所有内容
def duper(df):
    m = df.merge(coords)
    c = pd.concat([m, coords])
    # we put the merged rows first and those are
    # the ones we'll keep after `drop_duplicates(keep='first')`
    # `keep='first'` is the default, so I don't pass it
    c1 = (c.drop_duplicates().values == coords.values).all()

    # if `keep=False` then I drop all duplicates.  If I got
    # everything in `coords` this should be empty
    c2 = c.drop_duplicates(keep=False).empty
    return c1 & c2

source.set_index('car_id').groupby(level=0).filter(duper).index.unique().values

array([100, 120])

轻微替代

def duper(df):
    m = df.drop('car_id', 1).merge(coords)
    c = pd.concat([m, coords])
    c1 = (c.drop_duplicates().values == coords.values).all()
    c2 = c.drop_duplicates(keep=False).empty
    return c1 & c2

source.groupby('car_id').filter(duper).car_id.unique()