我有一个包含> 1M行的DataFrame。我想选择某个列包含某个子字符串的所有行:
matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()
但这种选择很慢,我想加快速度。我们说我只需要第一个 n 结果。有没有办法在获得 n 结果后停止matching
?我试过了:
matching = df['col2'].str.contains('substr', case=True, regex=False).head(n)
和
matching = df['col2'].str.contains('substr', case=True, regex=False).sample(n)
但他们没有更快。第二个语句是布尔值,非常快。我怎样才能加快第一个陈述?
答案 0 :(得分:2)
信不信由你.str访问者很慢。您可以使用具有更好性能的列表推导。
df = pd.DataFrame({'col2':np.random.choice(['substring','midstring','nostring','substrate'],100000)})
测试平等
all(df['col2'].str.contains('substr', case=True, regex=False) ==
pd.Series(['substr' in i for i in df['col2']]))
输出:
True
时序:
%timeit df['col2'].str.contains('substr', case=True, regex=False)
10 loops, best of 3: 37.9 ms per loop
与
%timeit pd.Series(['substr' in i for i in df['col2']])
100 loops, best of 3: 19.1 ms per loop
答案 1 :(得分:1)
你可以用:
来表达matching = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows = df['col1'].head(n)[matching==True]
但是,此解决方案会在第一个n
行中检索匹配结果,而不是第一个n
匹配结果。
如果您确实需要第一个n
匹配结果,则应使用:
rows = df['col1'][df['col2'].str.contains("substr")==True].head(n)
但是这个选项当然要慢一些。
受到@ ScottBoston答案的启发,您可以使用以下方法获得完整更快的解决方案:
rows = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)
这比使用此选项显示整个结果更快但速度更快。使用此解决方案,您可以获得第一个n
匹配结果。
使用以下测试代码,我们可以看到每个解决方案的速度和结果:
import pandas as pd
import time
n = 10
a = ["Result", "from", "first", "column", "for", "this", "matching", "test", "end"]
b = ["This", "is", "a", "test", "has substr", "also has substr", "end", "of", "test"]
col1 = a*1000000
col2 = b*1000000
df = pd.DataFrame({"col1":col1,"col2":col2})
# Original option
start_time = time.time()
matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()
print("--- %s seconds ---" % (time.time() - start_time))
# Faster option
start_time = time.time()
matching_fast = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows_fast = df['col1'].head(n)[matching==True]
print("--- %s seconds for fast solution ---" % (time.time() - start_time))
# Other option
start_time = time.time()
rows_other = df['col1'][df['col2'].str.contains("substr")==True].head(n)
print("--- %s seconds for other solution ---" % (time.time() - start_time))
# Complete option
start_time = time.time()
rows_complete = df['col1'][pd.Series(['substr' in i for i in df['col2']])==True].head(n)
print("--- %s seconds for complete solution ---" % (time.time() - start_time))
这将输出:
>>>
--- 2.33899998665 seconds ---
--- 0.302999973297 seconds for fast solution ---
--- 4.56700015068 seconds for other solution ---
--- 1.61599993706 seconds for complete solution ---
结果系列将是:
>>> rows
4 for
5 this
Name: col1, dtype: object
>>> rows_fast
4 for
5 this
Name: col1, dtype: object
>>> rows_other
4 for
5 this
13 for
14 this
22 for
23 this
31 for
32 this
40 for
41 this
Name: col1, dtype: object
>>> rows_complete
4 for
5 this
13 for
14 this
22 for
23 this
31 for
32 this
40 for
41 this
Name: col1, dtype: object