"降低"系列功能

时间:2016-01-26 00:18:05

标签: python performance pandas vectorization reduce

pandas系列是否有reduce的模拟?

例如,map的模拟是pd.Series.apply,但我找不到reduce的任何模拟。

我的申请是,我有一个大熊猫系列列表:

>>> business["categories"].head()

0                      ['Doctors', 'Health & Medical']
1                                        ['Nightlife']
2                 ['Active Life', 'Mini Golf', 'Golf']
3    ['Shopping', 'Home Services', 'Internet Servic...
4    ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object

我想使用reduce将系列列表合并在一起,如下所示:

categories = reduce(lambda l1, l2: l1 + l2, categories)

但是这需要花费可怕的时间,因为将两个列表合并在一起是O(n)时间在Python中。我希望pd.Series有一种矢量化的方式来更快地执行此操作。

4 个答案:

答案 0 :(得分:17)

值为Sub ActionGetCurrentUser(strCPU) 'strCPU is the computername set objWMI = GetObject("winmgmts:{impersonationLevel=impersonate}!\\" & strCPU & "\root\cimv2") set Items = objWMI.ExecQuery("Select * From Win32_ComputerSystem") For Each obj in Items OutStr = right(obj.username,9) Next Resultstring = "Logged in User is: " & OutStr Set objRootDSE = GetObject("LDAP://RootDSE") strDNSDomain = objRootDSE.Get("defaultNamingContext") strTarget = "LDAP://" & strDNSDomain ' ---------------- Write the User's account & password to a variable ------------------- strCurrentuser = Currentuser.value strPassword = PasswordArea.value ' ---------------- Connect to Ad Provider ---------------- Set objConnection = CreateObject("ADODB.Connection") objConnection.Provider = "ADsDSOObject" objConnection.Properties("User ID") = strCurrentUser ' pass credentials - if you omit this, the search is performed.... objConnection.Properties("Password") = strPassword ' ... with the current credentials objConnection.Properties("Encrypt Password") = True ' only needed if you set "User ID" and "Password" objConnection.Open "Active Directory Provider" Set objCmd = CreateObject("ADODB.Command") Set objCmd.ActiveConnection = objConnection objCmd.CommandText = "SELECT DisplayName FROM '" & strTarget & "' WHERE extensionAttribute11 = '" & OutStr & "'" Const ADS_SCOPE_SUBTREE = 2 objCmd.Properties("Page Size") = 100 objCmd.Properties("Timeout") = 30 objCmd.Properties("Searchscope") = ADS_SCOPE_SUBTREE objCmd.Properties("Cache Results") = False Set objRecordSet = objCmd.Execute If objRecordset.Recordcount = 0 then ' If no user is found then the recordcount will be 0 msgbox "No user is logged on" Resultstring = "" Set objCmd = Nothing Set objRootDSE = Nothing Set objRecordSet = Nothing Set objWMI = Nothing Set Items = Nothing exit sub End if Set objRecordSet = objCmd.Execute objRecordSet.MoveFirst Resultstring = Resultstring & vbcrlf & "Name: " & objRecordset.fields("DisplayName") Msgbox Resultstring Resultstring = "" Set objCmd = Nothing Set objRootDSE = Nothing Set objRecordSet = Nothing Set objWMI = Nothing Set Items = Nothing End Sub

这可能会更快:

itertools.chain()

效果

from itertools import chain
categories = list(chain.from_iterable(categories.values))

对于此数据集,from functools import reduce from itertools import chain categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000) %timeit list(chain.from_iterable(categories.values)) 1000 loops, best of 3: 231 µs per loop %timeit list(chain(*categories.values.flat)) 1000 loops, best of 3: 237 µs per loop %timeit reduce(lambda l1, l2: l1 + l2, categories) 100 loops, best of 3: 15.8 ms per loop 的速度提高了约68倍。

矢量?

当您拥有本机NumPy数据类型时,矢量化会起作用(大熊猫毕竟使用NumPy作为其数据)。由于我们已经在系列中列出了列表并希望得到一个列表,因此矢量化不太可能加快速度。标准Python对象和pandas / NumPy数据类型之间的转换可能会耗尽您从矢量化中获得的所有性能。我试图在另一个答案中对算法进行矢量化。

答案 1 :(得分:2)

矢量化但很慢

您可以使用NumPy的concatenate

import numpy as np

list(np.concatenate(categories.values))

效果

但我们已经有了列表,即Python对象。因此,矢量化必须在Python对象和NumPy数据类型之间来回切换。这会让事情变得缓慢:

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop

%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop

答案 2 :(得分:0)

你可以试试business["categories"].str.join(''),但我猜测Pandas使用Pythons字符串函数。我怀疑你能为Python已经提供的东西做得更好。

答案 3 :(得分:0)

我使用了"".join(business["categories"])

它比business["categories"].str.join('')快得多,但仍然比itertools.chain方法慢4倍。我更喜欢它,因为它更具可读性并且不需要导入。