Question

我在400万行的数据框中搜索子字符串或多个子字符串。

df[df.col.str.contains('Donald',case=True,na=False)]

或

df[df.col.str.contains('Donald|Trump|Dump',case=True,na=False)]

DataFrame（df）如下所示（有400万个字符串行）

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})

有没有提示让这个字符串搜索更快？例如，首先排序数据帧，某种索引方式，将列名更改为数字，从查询中删除“na = False”等？即使毫秒的速度提升也非常有用！

Answer 1

如果子串的数量很少，则一次搜索一个可能会更快，因为您可以将//code #include<iostream> #include<cmath> using namespace std; float single_median(int arr[],int size)//method to find median in an arr { if(size == 0) return -1; else if(size%2==0) return (arr[size/2] + arr[size/2 -1])/2.0; else return arr[size/2]; } float medianOf2(int a,int b)//median of two numbers { return ((a+b)/2.0); } int medianOf3(int a,int b,int c)//median of 3 numbers { int maximum = max(a,max(b,c)); int minimum = min(a,min(b,c)); return ((a+b+c) - maximum - minimum); } int medianOf4(int a, int b,int c,int d)//median of 4 numbers { int maximum = max(a,max(b,max(c,d))); int minimum = min(a,min(b,min(c,d))); return ((a+b+c+d) - maximum - minimum); } int find_median(int A[],int m,int B[],int n) { if(m<n)//here we will keep in mind that A is larger than B else we swap return find_median(B,n,A,m); if(n==0)//if smaller array has no element just find the median of larger array return single_median(A,m); if(n==1)//if smaller array has one element { if(m==1) return (A[0]+B[0])/2.0;//if both has one element just return the average else if(m&1)//when larger array has odd elements return medianOf2(medianOf3(B[0],A[m/2 - 1],A[m/2 + 1]),A[m/2]); else//for e return medianOf3(B[0],A[m/2],A[m/2 -1]); } if(n==2) { if(m==2) return medianOf4(A[0],B[0],A[1],B[1]); else if(m&1) return medianOf3(max(B[0],A[m/2 -1]),min(B[1],A[m/2 +1]),A[m/2]); else return medianOf4(max(B[0],A[m/2 -2]),min(B[1],A[m/2 +1]),A[m/2],A[m/2 -1]); } int mid_m = (m-1)/2; int mid_n = (n-1)/2; if(A[mid_m]<B[mid_n]) find_median(A + mid_m,m/2 +1 ,B,n - mid_n); else find_median(A,n/2 +1, B + mid_n, n/2 + 1); } int main() { int B[] = {1,2,3}; int A[] = {3,6,9,12}; cout<<find_median(A,4,B,3); return 0; }参数传递给regex=False，这会加快它的速度。

在大约6000行的示例DataFrame上，我在两个样本子串上测试了它，contains | blah.contains("foo", regex=False)的速度约为blah.contains("bar", regex=False)的两倍。您必须使用您的数据对其进行测试，以了解它的扩展程度。

Answer 2

您可以将其转换为列表。似乎在列表中搜索而不是将字符串方法应用于系列更快。

示例代码：

import timeit
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})



def first_way():
    df["new"] = pd.Series(df["col"].str.contains('Donald',case=True,na=False))
    return None
print "First_way: "
%timeit for x in range(10): first_way()
print df

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})


def second_way():
    listed = df["col"].tolist()
    df["new"] = ["Donald" in n for n in listed]
    return None

print "Second way: "
%timeit for x in range(10): second_way()
print df

结果：

First_way: 
100 loops, best of 3: 2.77 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False
Second way: 
1000 loops, best of 3: 1.79 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False

Answer 3

BrenBarn的上述回答帮助我解决了问题。只需写下我的问题及其解决方法即可。希望它可以帮助某人：）

我的数据大约有2000行。它主要是文本。以前，我使用带有忽略大小写的正则表达式，如下所示

reg_exp = ''.join(['(?=.*%s)' % (i) for i in search_list])
series_to_search = data_new.iloc[:,title_column_index] + ' : ' + data_new.iloc[:,description_column_index]  
data_new = data_new[series_to_search.str.contains(reg_exp, flags=re.IGNORECASE)]

对于包含['exception'，'VE20']的搜索列表，此代码花费了58.710898秒。

当我用一个简单的for循环替换此代码时，只用了0.055304秒。改进了1,061.60倍！！！

for search in search_list:            
    series_to_search = data_new.iloc[:,title_column_index] + ' : ' + data_new.iloc[:,description_column_index]
    data_new = data_new[series_to_search.str.lower().str.contains(search.lower())]

如何使pandas dataframe str.contains搜索速度更快

3 个答案: