时间测试

Question

我正在创建一个包含增量值的列，然后在列的开头添加一个字符串。当用于大数据时，这非常慢。请为此建议一种更快捷有效的方法。

df['New_Column'] = np.arange(df[0])+1
df['New_Column'] = 'str' + df['New_Column'].astype(str)

输入

id  Field   Value
1     A       1
2     B       0     
3     D       1

输出

id  Field   Value   New_Column
1     A       1     str_1
2     B       0     str_2
3     D       1     str_3

Answer 1

我将在混音中添加两个

numpy的

import React, { Component } from 'react';
import './App.css';

class Chessboard extends Component {
  constructor(props) {
    super(props);
    this.state = {
      knightSelected: false,
      bishopSelected: false

    }
    this.knightMoveHandler = this.knightMoveHandler.bind(this);
  }

  knightMoveHandler(e) {

    this.setState({
      knightSelected: true
    });
    console.log(this.state.knightSelected);
  }

  bishopMoveHandler(e) {

    this.setState({

      bishopSelected: true
    });
    console.log(this.state.knightSelected);
  }
  render() {
    return (
      <div className="App">
        <table>
          <tbody className="chessboard">
            <tr className="row">
              <td className="square">1</td>
              <td className="square">2</td>
              <td className="square">3</td>
              <td className="square">4</td>
              <td className="square">5</td>
              <td className="square">6</td>
              <td className="square">7</td>
              <td className="square">8</td>
            </tr>
            <tr className="row">
              <td className="square">9</td>
              <td className="square">10</td>
              <td className="square">11</td>
              <td className="square">12</td>
              <td className="square">13</td>
              <td className="square">14</td>
              <td className="square">15</td>
              <td className="square">16</td>
            </tr>
            <tr className="row">
              <td className="square">17</td>
              <td className="square">18</td>
              <td className="square">19</td>
              <td className="square">20</td>
              <td className="square">21</td>
              <td className="square">22</td>
              <td className="square">23</td>
              <td className="square">24</td>
            </tr>
            <tr className="row">
              <td className="square">25</td>
              <td className="square">26</td>
              <td className="square">27</td>
              <td className="square">28</td>
              <td className="square">29</td>
              <td className="square">30</td>
              <td className="square">31</td>
              <td className="square">32</td>
            </tr>
            <tr className="row">
              <td className="square">33</td>
              <td className="square">34</td>
              <td className="square">35</td>
              <td className="square">36
                <span className="knight" value="1" onClick={this.knightMoveHandler}>&#9822;</span>
              </td>
              <td className="square" >37
                <span className="bishop" onClick={this.bishopMoveHandler}>&#9821;</span>
              </td>
              <td className="square">38</td>
              <td className="square">39</td>
              <td className="square">40</td>
            </tr>
            <tr className="row">
              <td className="square">41</td>
              <td className="square">42</td>
              <td className="square">43</td>
              <td className="square">44</td>
              <td className="square">45</td>
              <td className="square">46</td>
              <td className="square">47</td>
              <td className="square">48</td>
            </tr>
            <tr className="row">
              <td className="square">49</td>
              <td className="square">50</td>
              <td className="square">51</td>
              <td className="square">52</td>
              <td className="square">53</td>
              <td className="square">54</td>
              <td className="square">55</td>
              <td className="square">56</td>
            </tr>
            <tr className="row">
              <td className="square">57</td>
              <td className="square">58</td>
              <td className="square">59</td>
              <td className="square">60</td>
              <td className="square">61</td>
              <td className="square">62</td>
              <td className="square">63</td>
              <td className="square">64</td>
            </tr>
          </tbody>
        </table>
      </div>
    );
  }
}

export default Chessboard;

理解

`from numpy.core.defchararray import add df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str))) id Field Value new 0 1 A 1 str_1 1 2 B 0 str_2 2 3 D 1 str_3`

Python 3.6+

f-string

时间测试

结论

理解胜过与简单相关的表现。请注意，这是cᴏʟᴅsᴘᴇᴇᴅ提出的方法。我很欣赏这些赞成票（谢谢你），但是我们应该归功于它应该归还的地方。

对理解进行Cython化似乎没有帮助。 f弦也没有 Divakar的df.assign(new=[f'str_{i}' for i in range(1, len(df) + 1)]) id Field Value new 0 1 A 1 str_1 1 2 B 0 str_2 2 3 D 1 str_3在大型数据上表现出色。

功能

numexp

%load_ext Cython

%%cython
def gen_list(l, h):
    return ['str_%s' % i for i in range(l, h)]

测试

pir1 = lambda d: d.assign(new=[f'str_{i}' for i in range(1, len(d) + 1)])
pir2 = lambda d: d.assign(new=add('str_', np.arange(1, len(d) + 1).astype(str)))
cld1 = lambda d: d.assign(new=['str_%s' % i for i in range(1, len(d) + 1)])
cld2 = lambda d: d.assign(new=gen_list(1, len(d) + 1))
jez1 = lambda d: d.assign(new='str_' + pd.Series(np.arange(1, len(d) + 1), d.index).astype(str))
div1 = lambda d: d.assign(new=create_inc_pattern(prefix_str='str_', start=1, stop=len(d) + 1))
div2 = lambda d: d.assign(new=create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=len(d) + 1))

结果

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'pir1 pir2 cld1 cld2 jez1 div1 div2'.split()
)

for i in res.index:
    d = pd.concat([df] * i)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import {j}, d'
        res.at[i, j] = timeit(stmt, setp, number=200)

res.plot(loglog=True)

提议的方法

在对字符串和数字dtypes进行了大量修改并利用它们之间的简单互操作性之后，我最终得到了一些内容来获得零填充字符串，因为NumPy运行良好并允许以这种方式进行矢量化操作 -

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

在numexpr中布线以加快广播+模数操作 -

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

因此，要用作新列：

df['New_Column'] = create_inc_pattern(prefix_str='str_', start=1, stop=len(df)+1)

样品运行 -

In [334]: create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=14)
Out[334]: 
array(['str_01', 'str_02', 'str_03', 'str_04', 'str_05', 'str_06',
       'str_07', 'str_08', 'str_09', 'str_10', 'str_11', 'str_12', 'str_13'], 
      dtype='|S6')

In [338]: create_inc_pattern(prefix_str='str_', start=1, stop=124)
Out[338]: 
array(['str_001', 'str_002', 'str_003', 'str_004', 'str_005', 'str_006',
       'str_007', 'str_008', 'str_009', 'str_010', 'str_011', 'str_012',..
       'str_115', 'str_116', 'str_117', 'str_118', 'str_119', 'str_120',
       'str_121', 'str_122', 'str_123'], 
      dtype='|S7')

说明

逐步运行示例的基本思路和解释

基本思想是创建ASCII等效数字数组，可以通过dtype转换查看或转换为字符串1。更具体地说，我们将创建uint8类型的数字。因此，每个字符串将由一维数字数组表示。对于将转换为2D数组的字符串列表，每行（1D数组）表示单个字符串。

1）输入：

In [22]: prefix_str='str_'
    ...: start=15
    ...: stop=24

2）参数：

In [23]: N = stop - start # count of numbers
    ...: W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

In [24]: N,W
Out[24]: (9, 2)

3）创建代表起始字符串的一维数字数组：

In [25]: padv = np.full(W,48,dtype=np.uint8)
    ...: a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

In [27]: a0
Out[27]: array([115, 116, 114,  95,  48,  48], dtype=uint8)

4）扩展到覆盖字符串范围作为2D数组：

In [33]: a1 = np.repeat(a0[None],N,axis=0)
    ...: r = np.arange(start, stop)
    ...: addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    ...: a1[:,len(prefix_str):] += addn.astype(a1.dtype)

In [34]: a1
Out[34]: 
array([[115, 116, 114,  95,  49,  53],
       [115, 116, 114,  95,  49,  54],
       [115, 116, 114,  95,  49,  55],
       [115, 116, 114,  95,  49,  56],
       [115, 116, 114,  95,  49,  57],
       [115, 116, 114,  95,  50,  48],
       [115, 116, 114,  95,  50,  49],
       [115, 116, 114,  95,  50,  50],
       [115, 116, 114,  95,  50,  51]], dtype=uint8)

5）因此，每行代表一个字符串的ascii等价物，每个字符串都与所需的输出相关。让我们在最后一步得到它：

In [35]: a1.view('S'+str(a1.shape[1])).ravel()
Out[35]: 
array(['str_15', 'str_16', 'str_17', 'str_18', 'str_19', 'str_20',
       'str_21', 'str_22', 'str_23'], 
      dtype='|S6')

计时

这是针对列表理解版本的快速时序测试，该版本似乎是最好地查看来自其他帖子的时间 -

In [339]: N = 10000

In [340]: %timeit ['str_%s'%i for i in range(N)]
1000 loops, best of 3: 1.12 ms per loop

In [341]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
1000 loops, best of 3: 490 µs per loop

In [342]: N = 100000

In [343]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 14 ms per loop

In [344]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 4 ms per loop

Python-3代码

在Python-3上，要获取字符串dtype数组，我们需要在中间int dtype数组上填充更多的零。因此，没有和使用具有适用于Python-3的numexpr版本的等价物最终成为这些行的东西 -

方法＃1（No numexpr）：

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

方法＃2（使用numexpr）：

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

计时 -

In [8]: N = 100000

In [9]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 18.5 ms per loop

In [10]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 6.06 ms per loop

Answer 4

一种可能的解决方案是将string转换为map：

df['New_Column'] = np.arange(len(df['a']))+1
df['New_Column'] = 'str_' + df['New_Column'].map(str)

使用增量值有效地创建新列

输入

输出

4 个答案:

numpy的

`from numpy.core.defchararray import add df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str))) id Field Value new 0 1 A 1 str_1 1 2 B 0 str_2 2 3 D 1 str_3`

时间测试

结论

功能

测试

结果

更多功能

提议的方法

说明

计时

Python-3代码