如果不包含在另一个列表中,则可以从数据框中的列表项中删除元素的绝佳方法

时间:2019-01-28 15:36:41

标签: python pandas

可以说我有以下列表:

list = ['a', 'b', 'c', 'd']

还有一个这样的DataFrame:

df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
Out:
       content
0  [a, b, abc]
1  [c, d, xyz]
2     [d, xyz]

我需要一个函数,该函数可以删除'content'列中不在'list'中的所有元素,因此我的输出应如下所示:

Out:  
  content
0  [a, b]
1  [b, d]
2     [d]

请考虑我实际的df大约有1m行,列表中有1k项。我尝试遍历行,但是花了好长时间...

5 个答案:

答案 0 :(得分:3)

IIUC

<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout
xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto"
android:layout_width="match_parent"
android:layout_height="wrap_content">

<TextView
    android:id="@+id/textView_player_home_name"
    android:layout_width="0dp"
    android:layout_height="wrap_content"
    android:layout_marginStart="8dp"
    android:layout_marginTop="8dp"
    android:layout_marginEnd="8dp"
    android:layout_marginBottom="8dp"
    android:paddingLeft="8dp"
    android:paddingTop="8dp"
    android:paddingRight="8dp"
    android:paddingBottom="8dp"
    android:text="@string/empty"
    android:textAlignment="viewEnd"
    android:textSize="16sp"
    app:layout_constraintBottom_toBottomOf="parent"
    app:layout_constraintEnd_toStartOf="@+id/textView_vs"
    app:layout_constraintStart_toStartOf="parent"
    app:layout_constraintTop_toTopOf="parent" />

<TextView
    android:id="@+id/textView_player_away_name"
    android:layout_width="0dp"
    android:layout_height="0dp"
    android:layout_marginStart="8dp"
    android:layout_marginEnd="8dp"
    android:drawableEnd="@drawable/ic_navigate_next_gray_24dp"
    android:drawableRight="@drawable/ic_navigate_next_gray_24dp"
    android:paddingLeft="8dp"
    android:paddingTop="8dp"
    android:paddingRight="8dp"
    android:paddingBottom="8dp"
    android:text="@string/empty"
    android:textAlignment="viewStart"
    android:textSize="16sp"
    app:layout_constraintBottom_toBottomOf="@+id/textView_player_home_name"
    app:layout_constraintEnd_toEndOf="parent"
    app:layout_constraintStart_toEndOf="@+id/textView_vs"
    app:layout_constraintTop_toTopOf="@+id/textView_player_home_name" />

<TextView
    android:id="@+id/textView_vs"
    android:layout_width="wrap_content"
    android:layout_height="0dp"
    android:layout_marginStart="8dp"
    android:layout_marginEnd="8dp"
    android:paddingLeft="8dp"
    android:paddingTop="8dp"
    android:paddingRight="8dp"
    android:paddingBottom="8dp"
    android:text="vs"
    android:textAlignment="center"
    android:textSize="16sp"
    app:layout_constraintBottom_toBottomOf="@+id/textView_player_home_name"
    app:layout_constraintEnd_toStartOf="@+id/textView_player_away_name"
    app:layout_constraintStart_toEndOf="@+id/textView_player_home_name"
    app:layout_constraintTop_toTopOf="@+id/textView_player_home_name" />

答案 1 :(得分:2)

一种方法是使用apply

keep = ['a', 'b', 'c', 'd'] # don't use list as a variable name
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})

df['fixed_content'] = df.apply(lambda row: [x for x in row['content'] if x in keep],axis=1)

答案 2 :(得分:2)

假设系列中的列表包含唯一值,则可以使用dict.keys计算交集,同时(在Python 3.7+中)保持顺序:

df['content'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]

print(df)

  content
0  [a, b]
1  [d, c]
2     [d]

答案 3 :(得分:0)

另一个使用filter

的选项
>>> list1 = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
>>> df['content']=[list(filter(lambda x:x in list1,i)) for i in df['content']]
>>> df
  content
0  [a, b]
1  [c, d]
2     [d]

答案 4 :(得分:0)

鉴于我们要检查成员资格的字符串列表的长度约为1k,因此可以通过将列表首先转换为set来大大提高已经发布的答案的效率。

在我的测试中,最快的方法是将列表转换为集合,然后使用W-B发布的答案:

l = set(l)
df['new'] = [[y for y in x if y in l] for x in df.content]

完整的测试代码和结果如下。我必须对真实数据集的确切性质做出一些假设,但我认为随机生成的字符串列表应该具有一定的代表性。请注意,由于遇到错误,我将解决方案从T Burgis排除在外-可能是我做错了事,但是由于他们已经评论了WB的解决方案更快,因此我并没有尽力找出解决方案出来。我还应该注意,出于一致性的考虑,对于所有解决方案,无论原始答案是否这样做,我都将结果分配给df['new']

import random
import string
import pandas as pd


def initial_setup():
    """
    Returns a 1m row x 1 column DataFrame, and a 992 element list of strings (all unique).
    """
    random.seed(1)
    keep = list(set([''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(1250)]))
    content = [[''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(5)] for j in range(1000000)]
    df = pd.DataFrame({'content': content})
    return df, keep


def jpp(df, L):
    df['new'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]


def wb(df, l):
    df['new'] = [[y for y in x if y in l] for x in df.content]


def jonathon(df, list1):
    df['new'] = [list(filter(lambda x:x in list1,i)) for i in df['content']]

无需转换即可进行测试:

In [3]: df, keep = initial_setup()
   ...: %timeit jpp(df, keep)
   ...: 
16.9 s ± 333 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: df, keep = initial_setup()
   ...: %timeit wb(df, keep)
1min ± 612 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: df, keep = initial_setup()
   ...: %timeit jonathon(df, keep)
1min 2s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

进行转换以进行设置的测试:

In [6]: df, keep = initial_setup()
   ...: %timeit jpp(df, set(keep))
   ...: 
1.7 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: df, keep = initial_setup()
   ...: %timeit wb(df, set(keep))
   ...: 
689 ms ± 20.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: df, keep = initial_setup()
   ...: %timeit jonathon(df, set(keep))
   ...: 
1.26 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)