我正在创建一个包含增量值的列,然后在列的开头添加一个字符串。当用于大数据时,这非常慢。请为此建议一种更快捷有效的方法。
df['New_Column'] = np.arange(df[0])+1
df['New_Column'] = 'str' + df['New_Column'].astype(str)
id Field Value
1 A 1
2 B 0
3 D 1
id Field Value New_Column
1 A 1 str_1
2 B 0 str_2
3 D 1 str_3
答案 0 :(得分:24)
我将在混音中添加两个
import React, { Component } from 'react';
import './App.css';
class Chessboard extends Component {
constructor(props) {
super(props);
this.state = {
knightSelected: false,
bishopSelected: false
}
this.knightMoveHandler = this.knightMoveHandler.bind(this);
}
knightMoveHandler(e) {
this.setState({
knightSelected: true
});
console.log(this.state.knightSelected);
}
bishopMoveHandler(e) {
this.setState({
bishopSelected: true
});
console.log(this.state.knightSelected);
}
render() {
return (
<div className="App">
<table>
<tbody className="chessboard">
<tr className="row">
<td className="square">1</td>
<td className="square">2</td>
<td className="square">3</td>
<td className="square">4</td>
<td className="square">5</td>
<td className="square">6</td>
<td className="square">7</td>
<td className="square">8</td>
</tr>
<tr className="row">
<td className="square">9</td>
<td className="square">10</td>
<td className="square">11</td>
<td className="square">12</td>
<td className="square">13</td>
<td className="square">14</td>
<td className="square">15</td>
<td className="square">16</td>
</tr>
<tr className="row">
<td className="square">17</td>
<td className="square">18</td>
<td className="square">19</td>
<td className="square">20</td>
<td className="square">21</td>
<td className="square">22</td>
<td className="square">23</td>
<td className="square">24</td>
</tr>
<tr className="row">
<td className="square">25</td>
<td className="square">26</td>
<td className="square">27</td>
<td className="square">28</td>
<td className="square">29</td>
<td className="square">30</td>
<td className="square">31</td>
<td className="square">32</td>
</tr>
<tr className="row">
<td className="square">33</td>
<td className="square">34</td>
<td className="square">35</td>
<td className="square">36
<span className="knight" value="1" onClick={this.knightMoveHandler}>♞</span>
</td>
<td className="square" >37
<span className="bishop" onClick={this.bishopMoveHandler}>♝</span>
</td>
<td className="square">38</td>
<td className="square">39</td>
<td className="square">40</td>
</tr>
<tr className="row">
<td className="square">41</td>
<td className="square">42</td>
<td className="square">43</td>
<td className="square">44</td>
<td className="square">45</td>
<td className="square">46</td>
<td className="square">47</td>
<td className="square">48</td>
</tr>
<tr className="row">
<td className="square">49</td>
<td className="square">50</td>
<td className="square">51</td>
<td className="square">52</td>
<td className="square">53</td>
<td className="square">54</td>
<td className="square">55</td>
<td className="square">56</td>
</tr>
<tr className="row">
<td className="square">57</td>
<td className="square">58</td>
<td className="square">59</td>
<td className="square">60</td>
<td className="square">61</td>
<td className="square">62</td>
<td className="square">63</td>
<td className="square">64</td>
</tr>
</tbody>
</table>
</div>
);
}
}
export default Chessboard;
理解from numpy.core.defchararray import add
df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str)))
id Field Value new
0 1 A 1 str_1
1 2 B 0 str_2
2 3 D 1 str_3
f-string
理解胜过与简单相关的表现。请注意,这是cᴏʟᴅsᴘᴇᴇᴅ提出的方法。我很欣赏这些赞成票(谢谢你),但是我们应该归功于它应该归还的地方。
对理解进行Cython化似乎没有帮助。 f弦也没有
Divakar的df.assign(new=[f'str_{i}' for i in range(1, len(df) + 1)])
id Field Value new
0 1 A 1 str_1
1 2 B 0 str_2
2 3 D 1 str_3
在大型数据上表现出色。
numexp
%load_ext Cython
%%cython
def gen_list(l, h):
return ['str_%s' % i for i in range(l, h)]
pir1 = lambda d: d.assign(new=[f'str_{i}' for i in range(1, len(d) + 1)])
pir2 = lambda d: d.assign(new=add('str_', np.arange(1, len(d) + 1).astype(str)))
cld1 = lambda d: d.assign(new=['str_%s' % i for i in range(1, len(d) + 1)])
cld2 = lambda d: d.assign(new=gen_list(1, len(d) + 1))
jez1 = lambda d: d.assign(new='str_' + pd.Series(np.arange(1, len(d) + 1), d.index).astype(str))
div1 = lambda d: d.assign(new=create_inc_pattern(prefix_str='str_', start=1, stop=len(d) + 1))
div2 = lambda d: d.assign(new=create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=len(d) + 1))
res = pd.DataFrame(
np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
'pir1 pir2 cld1 cld2 jez1 div1 div2'.split()
)
for i in res.index:
d = pd.concat([df] * i)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import {j}, d'
res.at[i, j] = timeit(stmt, setp, number=200)
res.plot(loglog=True)
res.div(res.min(1), 0)
pir1 pir2 cld1 cld2 jez1 div1 div2
10 1.243998 1.137877 1.006501 1.000000 1.798684 1.277133 1.427025
30 1.009771 1.144892 1.012283 1.000000 2.144972 1.210803 1.283230
100 1.090170 1.567300 1.039085 1.000000 3.134154 1.281968 1.356706
300 1.061804 2.260091 1.072633 1.000000 4.792343 1.051886 1.305122
1000 1.135483 3.401408 1.120250 1.033484 7.678876 1.077430 1.000000
3000 1.310274 5.179131 1.359795 1.362273 13.006764 1.317411 1.000000
10000 2.110001 7.861251 1.942805 1.696498 17.905551 1.974627 1.000000
30000 2.188024 8.236724 2.100529 1.872661 18.416222 1.875299 1.000000
答案 1 :(得分:16)
当其他所有方法都失败时,请使用列表理解:
df['NewColumn'] = ['str_%s' %i for i in range(1, len(df) + 1)]
如果您对函数进行cython化,则可以进一步加速:
%load_ext Cython
%%cython
def gen_list(l, h):
return ['str_%s' %i for i in range(l, h)]
注意,此代码在Python3.6.0(IPython6.2.1)上运行。感谢@hpaulj在评论中提高了解决方案。
# @jezrael's fastest solution
%%timeit
df['NewColumn'] = np.arange(len(df['a'])) + 1
df['NewColumn'] = 'str_' + df['New_Column'].map(str)
547 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# in this post - no cython
%timeit df['NewColumn'] = ['str_%s'%i for i in range(n)]
409 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# cythonized list comp
%timeit df['NewColumn'] = gen_list(1, len(df) + 1)
370 ms ± 9.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 2 :(得分:14)
在对字符串和数字dtypes进行了大量修改并利用它们之间的简单互操作性之后,我最终得到了一些内容来获得零填充字符串,因为NumPy运行良好并允许以这种方式进行矢量化操作 -
def create_inc_pattern(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
a1 = np.repeat(a0[None],N,axis=0)
r = np.arange(start, stop)
addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
a1[:,len(prefix_str):] += addn.astype(a1.dtype)
return a1.view('S'+str(a1.shape[1])).ravel()
在numexpr
中布线以加快广播+模数操作 -
import numexpr as ne
def create_inc_pattern_numexpr(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
a1 = np.repeat(a0[None],N,axis=0)
r = np.arange(start, stop)
r2D = r[:,None]
s = 10**np.arange(W-1,-1,-1)
addn = ne.evaluate('(r2D/s)%10')
a1[:,len(prefix_str):] += addn.astype(a1.dtype)
return a1.view('S'+str(a1.shape[1])).ravel()
因此,要用作新列:
df['New_Column'] = create_inc_pattern(prefix_str='str_', start=1, stop=len(df)+1)
样品运行 -
In [334]: create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=14)
Out[334]:
array(['str_01', 'str_02', 'str_03', 'str_04', 'str_05', 'str_06',
'str_07', 'str_08', 'str_09', 'str_10', 'str_11', 'str_12', 'str_13'],
dtype='|S6')
In [338]: create_inc_pattern(prefix_str='str_', start=1, stop=124)
Out[338]:
array(['str_001', 'str_002', 'str_003', 'str_004', 'str_005', 'str_006',
'str_007', 'str_008', 'str_009', 'str_010', 'str_011', 'str_012',..
'str_115', 'str_116', 'str_117', 'str_118', 'str_119', 'str_120',
'str_121', 'str_122', 'str_123'],
dtype='|S7')
逐步运行示例的基本思路和解释
基本思想是创建ASCII等效数字数组,可以通过dtype转换查看或转换为字符串1。更具体地说,我们将创建uint8类型的数字。因此,每个字符串将由一维数字数组表示。对于将转换为2D数组的字符串列表,每行(1D数组)表示单个字符串。
1)输入:
In [22]: prefix_str='str_'
...: start=15
...: stop=24
2)参数:
In [23]: N = stop - start # count of numbers
...: W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
In [24]: N,W
Out[24]: (9, 2)
3)创建代表起始字符串的一维数字数组:
In [25]: padv = np.full(W,48,dtype=np.uint8)
...: a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
In [27]: a0
Out[27]: array([115, 116, 114, 95, 48, 48], dtype=uint8)
4)扩展到覆盖字符串范围作为2D数组:
In [33]: a1 = np.repeat(a0[None],N,axis=0)
...: r = np.arange(start, stop)
...: addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
...: a1[:,len(prefix_str):] += addn.astype(a1.dtype)
In [34]: a1
Out[34]:
array([[115, 116, 114, 95, 49, 53],
[115, 116, 114, 95, 49, 54],
[115, 116, 114, 95, 49, 55],
[115, 116, 114, 95, 49, 56],
[115, 116, 114, 95, 49, 57],
[115, 116, 114, 95, 50, 48],
[115, 116, 114, 95, 50, 49],
[115, 116, 114, 95, 50, 50],
[115, 116, 114, 95, 50, 51]], dtype=uint8)
5)因此,每行代表一个字符串的ascii等价物,每个字符串都与所需的输出相关。让我们在最后一步得到它:
In [35]: a1.view('S'+str(a1.shape[1])).ravel()
Out[35]:
array(['str_15', 'str_16', 'str_17', 'str_18', 'str_19', 'str_20',
'str_21', 'str_22', 'str_23'],
dtype='|S6')
这是针对列表理解版本的快速时序测试,该版本似乎是最好地查看来自其他帖子的时间 -
In [339]: N = 10000
In [340]: %timeit ['str_%s'%i for i in range(N)]
1000 loops, best of 3: 1.12 ms per loop
In [341]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
1000 loops, best of 3: 490 µs per loop
In [342]: N = 100000
In [343]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 14 ms per loop
In [344]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 4 ms per loop
在Python-3上,要获取字符串dtype数组,我们需要在中间int dtype数组上填充更多的零。因此,没有和使用具有适用于Python-3的numexpr版本的等价物最终成为这些行的东西 -
方法#1(No numexpr):
def create_inc_pattern(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
dl = len(prefix_str)+W # datatype length
dt = np.uint8 # int datatype for string to-from conversion
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
r = np.arange(start, stop)
addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
a1 = np.repeat(a0[None],N,axis=0)
a1[:,len(prefix_str):] += addn.astype(dt)
a1.shape = (-1)
a2 = np.zeros((len(a1),4),dtype=dt)
a2[:,0] = a1
return np.frombuffer(a2.ravel(), dtype='U'+str(dl))
方法#2(使用numexpr):
import numexpr as ne
def create_inc_pattern_numexpr(prefix_str, start, stop):
N = stop - start # count of numbers
W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
dl = len(prefix_str)+W # datatype length
dt = np.uint8 # int datatype for string to-from conversion
padv = np.full(W,48,dtype=np.uint8)
a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
r = np.arange(start, stop)
r2D = r[:,None]
s = 10**np.arange(W-1,-1,-1)
addn = ne.evaluate('(r2D/s)%10')
a1 = np.repeat(a0[None],N,axis=0)
a1[:,len(prefix_str):] += addn.astype(dt)
a1.shape = (-1)
a2 = np.zeros((len(a1),4),dtype=dt)
a2[:,0] = a1
return np.frombuffer(a2.ravel(), dtype='U'+str(dl))
计时 -
In [8]: N = 100000
In [9]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 18.5 ms per loop
In [10]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 6.06 ms per loop
答案 3 :(得分:4)
一种可能的解决方案是将string
转换为map
:
df['New_Column'] = np.arange(len(df['a']))+1
df['New_Column'] = 'str_' + df['New_Column'].map(str)