计算多列python中字符串的实例

时间:2018-11-21 15:45:55

标签: python pandas dataframe duplicates

我有以下简单的数据框

import pandas as pd
df = pd.DataFrame({'column_a': ['a', 'b', 'c', 'd', 'e'],
                   'column_b': ['b', 'x', 'y', 'c', 'z']})


      column_a column_b
0        a        b
1        b        x
2        c        y
3        d        c
4        e        z

我希望显示出现在两列中的字符串:

result = ("b", "c")

谢谢

7 个答案:

答案 0 :(得分:7)

intersection

这概括了任意数量的列。

set.intersection(*map(set, map(df.get, df)))

{'b', 'c'}

答案 1 :(得分:5)

使用python的set对象:

in_a = set(df.column_a)
in_b = set(df.column_b)
in_both = in_a.intersection(in_b)

答案 2 :(得分:4)

类似于Sandeep Kadapa的解决方案。 (没有tolistloc。)

>>> tuple(df['column_a'][df['column_a'].isin(df['column_b'])])                                            
('b', 'c')

答案 3 :(得分:2)

数据

n = 10e3

ints = pd.DataFrame({'column_a': [1, 2, 3, 4, 5] * n,
                   'column_b': [2, 10, 9, 3, 8] * n})

strings = pd.DataFrame({'column_a': ['a', 'b', 'c', 'd', 'e'] * n,
                   'column_b': ['b', 'x', 'y', 'c', 'z'] * n})

方法

def using_isin(df):  # @timgeb
    return df['column_a'][df['column_a'].isin(df['column_b'])]

def using_isin_loc_tolist(df):  # @SandeepKadapa
    return df.loc[df['column_a'].isin(df['column_b'].tolist()),'column_a']

def using_melt_groupby(df):  # @W-B
    return df.melt().groupby('value').variable.nunique().loc[lambda x : x>1].index

def using_set_intersection(df):  # @GergesDib, @TBurgins
    return set(df['column_a']).intersection(set(df['column_b']))

def using_set_intersection_map(df):  # @piRSquared
    return set.intersection(*map(set, map(df.get, df)))

def using_reduce_np_intersect(df):  # @JonClements
    return reduce(np.intersect1d, df.values.T)

def using_np_any(df):  # @W-B
    return df.column_a[np.any(df['column_a'].values == df['column_b'].values[:, None], 0)]

列中包含整数的性能

%timeit -n 10 using_isin(ints)
977 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_reduce_np_intersect(ints)
1.31 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_set_intersection(ints)
1.54 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_set_intersection_map(ints)
1.59 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin_loc_tolist(ints)
2.39 ms ± 921 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_melt_groupby(ints)
34.2 ms ± 988 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_np_any(ints)
4.35 s ± 148 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

列是否包含字符串的性能

%timeit -n 10 using_set_intersection_map(strings)
1.16 ms ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_intersection_set(strings)
1.2 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin(strings)
1.69 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_isin_loc_tolist(strings)
2.15 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_melt_groupby(strings)
35.6 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_reduce_np_intersect(strings)
43 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 using_np_any(strings)
# too slow to count

答案 4 :(得分:1)

isintuple用作:

tuple(df.loc[df['column_a'].isin(df['column_b'].tolist()),'column_a'])
('b', 'c')

答案 5 :(得分:1)

这本质上是概念(使用集合),与发布的答案相同,但我觉得它更简单:

set(df.column_a) & set(df.column_b)

答案 6 :(得分:1)

使用melt

df.melt().groupby('value').variable.nunique().loc[lambda x : x>1].index
Out[79]: Index(['b', 'c'], dtype='object', name='value')

如果速度很重要

s1 = df['column_a'].values
s2 = df['column_b'].values

df.column_a[np.any(s1 == s2[:, None], 0)]