更快地检查熊猫系列中的元素是否存在于列表列表中

时间:2021-04-21 03:10:54

标签: python pandas

检查 series 中的元素是否存在于 list of list 中的最简单快捷的方法是什么。例如,我有一个系列和一个列表列表如下?我有一个循环可以做到这一点,但它有点慢,所以我想要一种更快的方法来做到这一点。

groups = []
for desc in descs: 
    for i in range(len(list_of_list)):
        if desc in list_of_list[i]:
            groups.append(i)

list_of_list = [['rfnd sms chrgs'],
 ['loan payment receipt'],
 ['zen june2018 aksg sal 1231552',
  'zen july2018 aksg sal 1411191',
  'zen aug2018 aksg mda sal 16014'],
 ['cshw agnes john udo mrs ',
  'cshw agnes john udo',
  'cshw agnes udo',
  'cshw agnes john'],
 ['sms alert charge outstanding'],
 ['maint fee recovery jul 2018', 'vat maint fee recovery jul 2018'],
 ['sept2018 aksg mda sal 20028',
  'oct2018 aksg mda sal 21929',
  'nov2018 aksg mda sal 25229'],
 ['sms alert charges 28th sep 26th oct 2018']]

descs = 

1959                            rfnd sms chrgs
1960        loan payment receipt
1961                zen june2018 aksg sal 1231552
1962        loan payment receipt
1963                     cshw agnes john udo mrs 
1964        maint fee frm 31 may 2018 28 jun 2018
1965    vat maint fee frm 31 may 2018 28 jun 2018
1966                 sms alert charge outstanding
1967        loan payment receipt
1968                zen july2018 aksg sal 1411191
1969        loan payment receipt

预期输出就像一个数字列表

e.g [1,2,3,4,5,6]

1 个答案:

答案 0 :(得分:1)

准备数据:

# merge a series without a name is not allowed
descs = descs.rename("descs")

# convert list of lists to a series
ll = pd.Series(list_of_list).explode().reset_index()
ll.columns = ["pos", "descs"]
>>> descs
1959                           rfnd sms chrgs
1960                     loan payment receipt
1961            zen june2018 aksg sal 1231552
1962                     loan payment receipt
1963                 cshw agnes john udo mrs
1964    maint fee frm 31 may 2018 28 jun 2018
1965    maint fee frm 31 may 2018 28 jun 2018
1966             sms alert charge outstanding
1967                     loan payment receipt
1968            zen july2018 aksg sal 1411191
1969                     loan payment receipt
Name: descs, dtype: object

>>> ll
    pos                                     descs
0     0                            rfnd sms chrgs
1     1                      loan payment receipt
2     2             zen june2018 aksg sal 1231552
3     2             zen july2018 aksg sal 1411191
4     2            zen aug2018 aksg mda sal 16014
5     3                  cshw agnes john udo mrs
6     3                       cshw agnes john udo
7     3                            cshw agnes udo
8     3                           cshw agnes john
9     4              sms alert charge outstanding
10    5               maint fee recovery jul 2018
11    5           vat maint fee recovery jul 2018
12    6               sept2018 aksg mda sal 20028
13    6                oct2018 aksg mda sal 21929
14    6                nov2018 aksg mda sal 25229
15    7  sms alert charges 28th sep 26th oct 2018

现在您可以合并 descsll 以获得您的号码列表:

df = pd.merge(descs, ll, on="descs", how="left").set_index(descs.index)
>>> df
                                      descs  pos
1959                         rfnd sms chrgs  0.0
1960                   loan payment receipt  1.0
1961          zen june2018 aksg sal 1231552  2.0
1962                   loan payment receipt  1.0
1963               cshw agnes john udo mrs   3.0
1964  maint fee frm 31 may 2018 28 jun 2018  NaN
1965  maint fee frm 31 may 2018 28 jun 2018  NaN
1966           sms alert charge outstanding  4.0
1967                   loan payment receipt  1.0
1968          zen july2018 aksg sal 1411191  2.0
1969                   loan payment receipt  1.0

检查:

>>> df.loc[1966, "descs"]
'sms alert charge outstanding'

>>> list_of_list[int(df.loc[1966, "pos"])]
['sms alert charge outstanding']

另一种方法

此方法利用了分类数据类型。可能会更快。

>>> ll = pd.Series(list_of_list).explode()
>>> descs.astype("category").map(pd.Series(ll.index, index=ll.astype("category")))
1959    0.0
1960    1.0
1961    2.0
1962    1.0
1963    3.0
1964    NaN
1965    NaN
1966    4.0
1967    1.0
1968    2.0
1969    1.0
dtype: float64