我有数据框,我尝试只获取字符串,其中某些列包含一些字符串。
我用:
df_res = pd.DataFrame()
for i in substr:
res = df[df['event_address'].str.contains(i)]
df
看起来像:
member_id,event_address,event_time,event_duration
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/albums,2015-05-01 00:00:05,8
g1497o1ofm5a1963,9829192.ru/apple-iphone.html,2015-05-01 00:00:15,2
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/album/165150?&p=3,2015-05-01 00:00:17,2
g1497o1ofm5a1963,fotki.yandex.ru/tags/%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=utpaladev&&p=2,2015-05-01 00:01:31,10
g1497o1ofm5a1963,3gmaster.net,2015-05-01 00:01:41,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&&p=2,2015-05-01 00:02:01,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=Sunny-Fanny&,2015-05-01 00:02:31,2
g1497o1ofm5a1963,fotki.9829192.ru/apple-iphone.html,2015-05-01 00:03:25,6
和substr
是:
123.ru/gadgets/communicators
320-8080.ru/mobilephones
3gmaster.net
3-q.ru/products/smartfony/s
9829192.ru/apple-iphone.html
9829192.ru/index.php?cat=1
acer.com/ac/ru/ru/content/group/smartphones
aj.ru
我使用此代码得到了理想的结果,但它很长。
我也尝试使用列(substr
它是substr = urls.url.values.tolist()
)
我试试
res = df[df['event_address'].str.contains(urls.url)]
但它返回:
TypeError:'系列'对象是可变的,因此它们不能被散列
是否有任何方法可以让它更快或者我错了?
答案 0 :(得分:2)
这样做:
def check_exists(x):
for i in substr:
if i in x:
return True
return False
df2 = df.ix[df.event_address.map(check_exists)]
或者如果你喜欢用单行写的话:
df.ix[df.event_address.map(lambda x: any([True for i in substr if i in x]))]
答案 1 :(得分:1)
如果需要更快的解决方案,我认为您需要join
向str.contains
添加|
:
res = df[df['event_address'].str.contains('|'.join(urls.url))]
print (res)
member_id event_address event_time \
1 g1497o1ofm5a1963 9829192.ru/apple-iphone.html 2015-05-01 00:00:15
4 g1497o1ofm5a1963 3gmaster.net 2015-05-01 00:01:41
7 g1497o1ofm5a1963 fotki.9829192.ru/apple-iphone.html 2015-05-01 00:03:25
event_duration
1 2
4 6
7 6
另一个list comprehension
解决方案:
res = df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()]))]
print (res)
member_id event_address event_time \
1 g1497o1ofm5a1963 9829192.ru/apple-iphone.html 2015-05-01 00:00:15
4 g1497o1ofm5a1963 3gmaster.net 2015-05-01 00:01:41
7 g1497o1ofm5a1963 fotki.9829192.ru/apple-iphone.html 2015-05-01 00:03:25
event_duration
1 2
4 6
7 6
<强>计时强>:
#[8000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
In [68]: %timeit (df[df['event_address'].str.contains('|'.join(urls.url))])
100 loops, best of 3: 12 ms per loop
In [69]: %timeit (df.ix[df.event_address.map(check_exists)])
10 loops, best of 3: 155 ms per loop
In [70]: %timeit (df.ix[df.event_address.map(lambda x: any([True for i in urls.url.tolist() if i in x]))])
10 loops, best of 3: 163 ms per loop
In [71]: %timeit (df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()] ))])
10 loops, best of 3: 174 ms per loop