Question

pandas系列是否有reduce的模拟？

例如，map的模拟是pd.Series.apply，但我找不到reduce的任何模拟。

我的申请是，我有一个大熊猫系列列表：

>>> business["categories"].head()

0                      ['Doctors', 'Health & Medical']
1                                        ['Nightlife']
2                 ['Active Life', 'Mini Golf', 'Golf']
3    ['Shopping', 'Home Services', 'Internet Servic...
4    ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object

我想使用reduce将系列列表合并在一起，如下所示：

categories = reduce(lambda l1, l2: l1 + l2, categories)

但是这需要花费可怕的时间，因为将两个列表合并在一起是O(n)时间在Python中。我希望pd.Series有一种矢量化的方式来更快地执行此操作。

Answer 1

值为Sub ActionGetCurrentUser(strCPU) 'strCPU is the computername set objWMI = GetObject("winmgmts:{impersonationLevel=impersonate}!\\" & strCPU & "\root\cimv2") set Items = objWMI.ExecQuery("Select * From Win32_ComputerSystem") For Each obj in Items OutStr = right(obj.username,9) Next Resultstring = "Logged in User is: " & OutStr Set objRootDSE = GetObject("LDAP://RootDSE") strDNSDomain = objRootDSE.Get("defaultNamingContext") strTarget = "LDAP://" & strDNSDomain ' ---------------- Write the User's account & password to a variable ------------------- strCurrentuser = Currentuser.value strPassword = PasswordArea.value ' ---------------- Connect to Ad Provider ---------------- Set objConnection = CreateObject("ADODB.Connection") objConnection.Provider = "ADsDSOObject" objConnection.Properties("User ID") = strCurrentUser ' pass credentials - if you omit this, the search is performed.... objConnection.Properties("Password") = strPassword ' ... with the current credentials objConnection.Properties("Encrypt Password") = True ' only needed if you set "User ID" and "Password" objConnection.Open "Active Directory Provider" Set objCmd = CreateObject("ADODB.Command") Set objCmd.ActiveConnection = objConnection objCmd.CommandText = "SELECT DisplayName FROM '" & strTarget & "' WHERE extensionAttribute11 = '" & OutStr & "'" Const ADS_SCOPE_SUBTREE = 2 objCmd.Properties("Page Size") = 100 objCmd.Properties("Timeout") = 30 objCmd.Properties("Searchscope") = ADS_SCOPE_SUBTREE objCmd.Properties("Cache Results") = False Set objRecordSet = objCmd.Execute If objRecordset.Recordcount = 0 then ' If no user is found then the recordcount will be 0 msgbox "No user is logged on" Resultstring = "" Set objCmd = Nothing Set objRootDSE = Nothing Set objRecordSet = Nothing Set objWMI = Nothing Set Items = Nothing exit sub End if Set objRecordSet = objCmd.Execute objRecordSet.MoveFirst Resultstring = Resultstring & vbcrlf & "Name: " & objRecordset.fields("DisplayName") Msgbox Resultstring Resultstring = "" Set objCmd = Nothing Set objRootDSE = Nothing Set objRecordSet = Nothing Set objWMI = Nothing Set Items = Nothing End Sub

这可能会更快：

itertools.chain()

效果

from itertools import chain
categories = list(chain.from_iterable(categories.values))

对于此数据集，from functools import reduce from itertools import chain categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000) %timeit list(chain.from_iterable(categories.values)) 1000 loops, best of 3: 231 µs per loop %timeit list(chain(*categories.values.flat)) 1000 loops, best of 3: 237 µs per loop %timeit reduce(lambda l1, l2: l1 + l2, categories) 100 loops, best of 3: 15.8 ms per loop的速度提高了约68倍。

矢量？

当您拥有本机NumPy数据类型时，矢量化会起作用（大熊猫毕竟使用NumPy作为其数据）。由于我们已经在系列中列出了列表并希望得到一个列表，因此矢量化不太可能加快速度。标准Python对象和pandas / NumPy数据类型之间的转换可能会耗尽您从矢量化中获得的所有性能。我试图在另一个答案中对算法进行矢量化。

Answer 2

矢量化但很慢

您可以使用NumPy的concatenate：

import numpy as np

list(np.concatenate(categories.values))

效果

但我们已经有了列表，即Python对象。因此，矢量化必须在Python对象和NumPy数据类型之间来回切换。这会让事情变得缓慢：

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop

%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop

Answer 3

你可以试试business["categories"].str.join('')，但我猜测Pandas使用Pythons字符串函数。我怀疑你能为Python已经提供的东西做得更好。

Answer 4

我使用了"".join(business["categories"])

它比business["categories"].str.join('')快得多，但仍然比itertools.chain方法慢4倍。我更喜欢它，因为它更具可读性并且不需要导入。

＆＃34;降低＆＃34;系列功能

4 个答案:

效果

矢量？

矢量化但很慢

效果