我从Twitter获得了非常大的数据集。我试图弄清楚如何做像下面的numpy一样的python过滤。环境是python解释器
>>tweets = [['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'],
['is nice man that buhari']]
>>>filter(lambda x: 'buhari' in x[0].lower(), tweets)
[['buhari si good'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']]
我尝试了如下所示的布尔索引,但是数组变成空的
>>>tweet_arr = np.array([['buhari si good'], ['atiku is great'], ['buhari nfd sdfa atiku'], ['is nice man that buhari']])
>>>flat_tweets = tweet_arr[:, 0]
>>>flat_tweets
array(['buhari si good', 'atiku is great', 'buhari nfd sdfa atiku',
'is nice man that buhari'], dtype='|S23')
>>>flat_tweets['buhari' in flat_tweets]
array([], shape=(0, 4), dtype='|S23')
我想知道如何在numpy数组中过滤字符串,这是我在此处可以轻松过滤偶数的方式
>>> arr = np.arange(15).reshape((15,1))
>>>arr
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10],
[11],
[12],
[13],
[14]])
>>>arr[:][arr % 2 == 0]
array([ 0, 2, 4, 6, 8, 10, 12, 14])
谢谢
答案 0 :(得分:2)
如果您要坚持完全基于NumPy的解决方案,则可以这样做
from numpy.core.defchararray import find, lower
tweet_arr[find(lower(tweet_arr), 'buhari') != -1]
您在评论中提到,此处要寻找的是性能,因此应注意,这似乎比您自己想出的解决方案要慢得多:
In [33]: large_arr = np.repeat(tweet_arr, 10000)
In [36]: %timeit large_arr[find(lower(large_arr), 'buhari') != -1]
54.6 ms ± 765 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [43]: %timeit list(filter(lambda x: 'buhari' in x.lower(), large_arr))
21.2 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
实际上,普通的列表理解方法胜过两种方法:
In [44]: %timeit [x for x in large_arr if 'buhari' in x.lower()]
18.5 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)