pandas系列是否有reduce
的模拟?
例如,map
的模拟是pd.Series.apply,但我找不到reduce
的任何模拟。
我的申请是,我有一个大熊猫系列列表:
>>> business["categories"].head()
0 ['Doctors', 'Health & Medical']
1 ['Nightlife']
2 ['Active Life', 'Mini Golf', 'Golf']
3 ['Shopping', 'Home Services', 'Internet Servic...
4 ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object
我想使用reduce
将系列列表合并在一起,如下所示:
categories = reduce(lambda l1, l2: l1 + l2, categories)
但是这需要花费可怕的时间,因为将两个列表合并在一起是O(n)
时间在Python中。我希望pd.Series
有一种矢量化的方式来更快地执行此操作。
答案 0 :(得分:17)
Sub ActionGetCurrentUser(strCPU) 'strCPU is the computername
set objWMI = GetObject("winmgmts:{impersonationLevel=impersonate}!\\" & strCPU & "\root\cimv2")
set Items = objWMI.ExecQuery("Select * From Win32_ComputerSystem")
For Each obj in Items
OutStr = right(obj.username,9)
Next
Resultstring = "Logged in User is: " & OutStr
Set objRootDSE = GetObject("LDAP://RootDSE")
strDNSDomain = objRootDSE.Get("defaultNamingContext")
strTarget = "LDAP://" & strDNSDomain
' ---------------- Write the User's account & password to a variable -------------------
strCurrentuser = Currentuser.value
strPassword = PasswordArea.value
' ---------------- Connect to Ad Provider ----------------
Set objConnection = CreateObject("ADODB.Connection")
objConnection.Provider = "ADsDSOObject"
objConnection.Properties("User ID") = strCurrentUser ' pass credentials - if you omit this, the search is performed....
objConnection.Properties("Password") = strPassword ' ... with the current credentials
objConnection.Properties("Encrypt Password") = True ' only needed if you set "User ID" and "Password"
objConnection.Open "Active Directory Provider"
Set objCmd = CreateObject("ADODB.Command")
Set objCmd.ActiveConnection = objConnection
objCmd.CommandText = "SELECT DisplayName FROM '" & strTarget & "' WHERE extensionAttribute11 = '" & OutStr & "'"
Const ADS_SCOPE_SUBTREE = 2
objCmd.Properties("Page Size") = 100
objCmd.Properties("Timeout") = 30
objCmd.Properties("Searchscope") = ADS_SCOPE_SUBTREE
objCmd.Properties("Cache Results") = False
Set objRecordSet = objCmd.Execute
If objRecordset.Recordcount = 0 then ' If no user is found then the recordcount will be 0
msgbox "No user is logged on"
Resultstring = ""
Set objCmd = Nothing
Set objRootDSE = Nothing
Set objRecordSet = Nothing
Set objWMI = Nothing
Set Items = Nothing
exit sub
End if
Set objRecordSet = objCmd.Execute
objRecordSet.MoveFirst
Resultstring = Resultstring & vbcrlf & "Name: " & objRecordset.fields("DisplayName")
Msgbox Resultstring
Resultstring = ""
Set objCmd = Nothing
Set objRootDSE = Nothing
Set objRecordSet = Nothing
Set objWMI = Nothing
Set Items = Nothing
End Sub
这可能会更快:
itertools.chain()
from itertools import chain
categories = list(chain.from_iterable(categories.values))
对于此数据集,from functools import reduce
from itertools import chain
categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)
%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop
%timeit list(chain(*categories.values.flat))
1000 loops, best of 3: 237 µs per loop
%timeit reduce(lambda l1, l2: l1 + l2, categories)
100 loops, best of 3: 15.8 ms per loop
的速度提高了约68倍。
当您拥有本机NumPy数据类型时,矢量化会起作用(大熊猫毕竟使用NumPy作为其数据)。由于我们已经在系列中列出了列表并希望得到一个列表,因此矢量化不太可能加快速度。标准Python对象和pandas / NumPy数据类型之间的转换可能会耗尽您从矢量化中获得的所有性能。我试图在另一个答案中对算法进行矢量化。
答案 1 :(得分:2)
您可以使用NumPy的concatenate
:
import numpy as np
list(np.concatenate(categories.values))
但我们已经有了列表,即Python对象。因此,矢量化必须在Python对象和NumPy数据类型之间来回切换。这会让事情变得缓慢:
categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)
%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop
%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop
%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop
答案 2 :(得分:0)
你可以试试business["categories"].str.join('')
,但我猜测Pandas使用Pythons字符串函数。我怀疑你能为Python已经提供的东西做得更好。
答案 3 :(得分:0)
我使用了"".join(business["categories"])
它比business["categories"].str.join('')
快得多,但仍然比itertools.chain
方法慢4倍。我更喜欢它,因为它更具可读性并且不需要导入。