可以说我有以下列表:
list = ['a', 'b', 'c', 'd']
还有一个这样的DataFrame:
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
Out:
content
0 [a, b, abc]
1 [c, d, xyz]
2 [d, xyz]
我需要一个函数,该函数可以删除'content'列中不在'list'中的所有元素,因此我的输出应如下所示:
Out:
content
0 [a, b]
1 [b, d]
2 [d]
请考虑我实际的df大约有1m行,列表中有1k项。我尝试遍历行,但是花了好长时间...
答案 0 :(得分:3)
IIUC
<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout
xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto"
android:layout_width="match_parent"
android:layout_height="wrap_content">
<TextView
android:id="@+id/textView_player_home_name"
android:layout_width="0dp"
android:layout_height="wrap_content"
android:layout_marginStart="8dp"
android:layout_marginTop="8dp"
android:layout_marginEnd="8dp"
android:layout_marginBottom="8dp"
android:paddingLeft="8dp"
android:paddingTop="8dp"
android:paddingRight="8dp"
android:paddingBottom="8dp"
android:text="@string/empty"
android:textAlignment="viewEnd"
android:textSize="16sp"
app:layout_constraintBottom_toBottomOf="parent"
app:layout_constraintEnd_toStartOf="@+id/textView_vs"
app:layout_constraintStart_toStartOf="parent"
app:layout_constraintTop_toTopOf="parent" />
<TextView
android:id="@+id/textView_player_away_name"
android:layout_width="0dp"
android:layout_height="0dp"
android:layout_marginStart="8dp"
android:layout_marginEnd="8dp"
android:drawableEnd="@drawable/ic_navigate_next_gray_24dp"
android:drawableRight="@drawable/ic_navigate_next_gray_24dp"
android:paddingLeft="8dp"
android:paddingTop="8dp"
android:paddingRight="8dp"
android:paddingBottom="8dp"
android:text="@string/empty"
android:textAlignment="viewStart"
android:textSize="16sp"
app:layout_constraintBottom_toBottomOf="@+id/textView_player_home_name"
app:layout_constraintEnd_toEndOf="parent"
app:layout_constraintStart_toEndOf="@+id/textView_vs"
app:layout_constraintTop_toTopOf="@+id/textView_player_home_name" />
<TextView
android:id="@+id/textView_vs"
android:layout_width="wrap_content"
android:layout_height="0dp"
android:layout_marginStart="8dp"
android:layout_marginEnd="8dp"
android:paddingLeft="8dp"
android:paddingTop="8dp"
android:paddingRight="8dp"
android:paddingBottom="8dp"
android:text="vs"
android:textAlignment="center"
android:textSize="16sp"
app:layout_constraintBottom_toBottomOf="@+id/textView_player_home_name"
app:layout_constraintEnd_toStartOf="@+id/textView_player_away_name"
app:layout_constraintStart_toEndOf="@+id/textView_player_home_name"
app:layout_constraintTop_toTopOf="@+id/textView_player_home_name" />
答案 1 :(得分:2)
一种方法是使用apply
:
keep = ['a', 'b', 'c', 'd'] # don't use list as a variable name
df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
df['fixed_content'] = df.apply(lambda row: [x for x in row['content'] if x in keep],axis=1)
答案 2 :(得分:2)
假设系列中的列表包含唯一值,则可以使用dict.keys
计算交集,同时(在Python 3.7+中)保持顺序:
df['content'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]
print(df)
content
0 [a, b]
1 [d, c]
2 [d]
答案 3 :(得分:0)
另一个使用filter
>>> list1 = ['a', 'b', 'c', 'd']
>>> df = pd.DataFrame({'content': [['a', 'b', 'abc'], ['c', 'd', 'xyz'], ['d', 'xyz']]})
>>> df['content']=[list(filter(lambda x:x in list1,i)) for i in df['content']]
>>> df
content
0 [a, b]
1 [c, d]
2 [d]
答案 4 :(得分:0)
鉴于我们要检查成员资格的字符串列表的长度约为1k,因此可以通过将列表首先转换为set
来大大提高已经发布的答案的效率。
在我的测试中,最快的方法是将列表转换为集合,然后使用W-B发布的答案:
l = set(l)
df['new'] = [[y for y in x if y in l] for x in df.content]
完整的测试代码和结果如下。我必须对真实数据集的确切性质做出一些假设,但我认为随机生成的字符串列表应该具有一定的代表性。请注意,由于遇到错误,我将解决方案从T Burgis排除在外-可能是我做错了事,但是由于他们已经评论了WB的解决方案更快,因此我并没有尽力找出解决方案出来。我还应该注意,出于一致性的考虑,对于所有解决方案,无论原始答案是否这样做,我都将结果分配给df['new']
。
import random
import string
import pandas as pd
def initial_setup():
"""
Returns a 1m row x 1 column DataFrame, and a 992 element list of strings (all unique).
"""
random.seed(1)
keep = list(set([''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(1250)]))
content = [[''.join(random.choices(string.ascii_lowercase, k=random.randint(1, 5))) for i in range(5)] for j in range(1000000)]
df = pd.DataFrame({'content': content})
return df, keep
def jpp(df, L):
df['new'] = [list(dict.fromkeys(x).keys() & L) for x in df['content']]
def wb(df, l):
df['new'] = [[y for y in x if y in l] for x in df.content]
def jonathon(df, list1):
df['new'] = [list(filter(lambda x:x in list1,i)) for i in df['content']]
无需转换即可进行测试:
In [3]: df, keep = initial_setup()
...: %timeit jpp(df, keep)
...:
16.9 s ± 333 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: df, keep = initial_setup()
...: %timeit wb(df, keep)
1min ± 612 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: df, keep = initial_setup()
...: %timeit jonathon(df, keep)
1min 2s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
进行转换以进行设置的测试:
In [6]: df, keep = initial_setup()
...: %timeit jpp(df, set(keep))
...:
1.7 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: df, keep = initial_setup()
...: %timeit wb(df, set(keep))
...:
689 ms ± 20.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: df, keep = initial_setup()
...: %timeit jonathon(df, set(keep))
...:
1.26 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)