使用增量值有效地创建新列

时间:2018-03-24 20:24:52

标签: python performance pandas numpy

我正在创建一个包含增量值的列,然后在列的开头添加一个字符串。当用于大数据时,这非常慢。请为此建议一种更快捷有效的方法。

df['New_Column'] = np.arange(df[0])+1
df['New_Column'] = 'str' + df['New_Column'].astype(str)

输入

id  Field   Value
1     A       1
2     B       0     
3     D       1

输出

id  Field   Value   New_Column
1     A       1     str_1
2     B       0     str_2
3     D       1     str_3

4 个答案:

答案 0 :(得分:24)

我将在混音中添加两个

numpy的

import React, { Component } from 'react';
import './App.css';

class Chessboard extends Component {
  constructor(props) {
    super(props);
    this.state = {
      knightSelected: false,
      bishopSelected: false

    }
    this.knightMoveHandler = this.knightMoveHandler.bind(this);
  }

  knightMoveHandler(e) {

    this.setState({
      knightSelected: true
    });
    console.log(this.state.knightSelected);
  }

  bishopMoveHandler(e) {

    this.setState({

      bishopSelected: true
    });
    console.log(this.state.knightSelected);
  }
  render() {
    return (
      <div className="App">
        <table>
          <tbody className="chessboard">
            <tr className="row">
              <td className="square">1</td>
              <td className="square">2</td>
              <td className="square">3</td>
              <td className="square">4</td>
              <td className="square">5</td>
              <td className="square">6</td>
              <td className="square">7</td>
              <td className="square">8</td>
            </tr>
            <tr className="row">
              <td className="square">9</td>
              <td className="square">10</td>
              <td className="square">11</td>
              <td className="square">12</td>
              <td className="square">13</td>
              <td className="square">14</td>
              <td className="square">15</td>
              <td className="square">16</td>
            </tr>
            <tr className="row">
              <td className="square">17</td>
              <td className="square">18</td>
              <td className="square">19</td>
              <td className="square">20</td>
              <td className="square">21</td>
              <td className="square">22</td>
              <td className="square">23</td>
              <td className="square">24</td>
            </tr>
            <tr className="row">
              <td className="square">25</td>
              <td className="square">26</td>
              <td className="square">27</td>
              <td className="square">28</td>
              <td className="square">29</td>
              <td className="square">30</td>
              <td className="square">31</td>
              <td className="square">32</td>
            </tr>
            <tr className="row">
              <td className="square">33</td>
              <td className="square">34</td>
              <td className="square">35</td>
              <td className="square">36
                <span className="knight" value="1" onClick={this.knightMoveHandler}>&#9822;</span>
              </td>
              <td className="square" >37
                <span className="bishop" onClick={this.bishopMoveHandler}>&#9821;</span>
              </td>
              <td className="square">38</td>
              <td className="square">39</td>
              <td className="square">40</td>
            </tr>
            <tr className="row">
              <td className="square">41</td>
              <td className="square">42</td>
              <td className="square">43</td>
              <td className="square">44</td>
              <td className="square">45</td>
              <td className="square">46</td>
              <td className="square">47</td>
              <td className="square">48</td>
            </tr>
            <tr className="row">
              <td className="square">49</td>
              <td className="square">50</td>
              <td className="square">51</td>
              <td className="square">52</td>
              <td className="square">53</td>
              <td className="square">54</td>
              <td className="square">55</td>
              <td className="square">56</td>
            </tr>
            <tr className="row">
              <td className="square">57</td>
              <td className="square">58</td>
              <td className="square">59</td>
              <td className="square">60</td>
              <td className="square">61</td>
              <td className="square">62</td>
              <td className="square">63</td>
              <td className="square">64</td>
            </tr>
          </tbody>
        </table>
      </div>
    );
  }
}

export default Chessboard;
理解

from numpy.core.defchararray import add df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str))) id Field Value new 0 1 A 1 str_1 1 2 B 0 str_2 2 3 D 1 str_3

Python 3.6+
f-string

时间测试

结论

理解胜过与简单相关的表现。请注意,这是cᴏʟᴅsᴘᴇᴇᴅ提出的方法。我很欣赏这些赞成票(谢谢你),但是我们应该归功于它应该归还的地方。

对理解进行Cython化似乎没有帮助。 f弦也没有 Divakar的df.assign(new=[f'str_{i}' for i in range(1, len(df) + 1)]) id Field Value new 0 1 A 1 str_1 1 2 B 0 str_2 2 3 D 1 str_3 在大型数据上表现出色。

功能

numexp
%load_ext Cython
%%cython
def gen_list(l, h):
    return ['str_%s' % i for i in range(l, h)]

测试

pir1 = lambda d: d.assign(new=[f'str_{i}' for i in range(1, len(d) + 1)])
pir2 = lambda d: d.assign(new=add('str_', np.arange(1, len(d) + 1).astype(str)))
cld1 = lambda d: d.assign(new=['str_%s' % i for i in range(1, len(d) + 1)])
cld2 = lambda d: d.assign(new=gen_list(1, len(d) + 1))
jez1 = lambda d: d.assign(new='str_' + pd.Series(np.arange(1, len(d) + 1), d.index).astype(str))
div1 = lambda d: d.assign(new=create_inc_pattern(prefix_str='str_', start=1, stop=len(d) + 1))
div2 = lambda d: d.assign(new=create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=len(d) + 1))

结果

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'pir1 pir2 cld1 cld2 jez1 div1 div2'.split()
)

for i in res.index:
    d = pd.concat([df] * i)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import {j}, d'
        res.at[i, j] = timeit(stmt, setp, number=200)

enter image description here

res.plot(loglog=True)

更多功能

res.div(res.min(1), 0)

           pir1      pir2      cld1      cld2       jez1      div1      div2
10     1.243998  1.137877  1.006501  1.000000   1.798684  1.277133  1.427025
30     1.009771  1.144892  1.012283  1.000000   2.144972  1.210803  1.283230
100    1.090170  1.567300  1.039085  1.000000   3.134154  1.281968  1.356706
300    1.061804  2.260091  1.072633  1.000000   4.792343  1.051886  1.305122
1000   1.135483  3.401408  1.120250  1.033484   7.678876  1.077430  1.000000
3000   1.310274  5.179131  1.359795  1.362273  13.006764  1.317411  1.000000
10000  2.110001  7.861251  1.942805  1.696498  17.905551  1.974627  1.000000
30000  2.188024  8.236724  2.100529  1.872661  18.416222  1.875299  1.000000

答案 1 :(得分:16)

当其他所有方法都失败时,请使用列表理解

df['NewColumn'] = ['str_%s' %i for i in range(1, len(df) + 1)]

如果您对函数进行cython化,则可以进一步加速:

%load_ext Cython

%%cython
def gen_list(l, h):
    return ['str_%s' %i for i in range(l, h)]

注意,此代码在Python3.6.0(IPython6.2.1)上运行。感谢@hpaulj在评论中提高了解决方案。

# @jezrael's fastest solution

%%timeit
df['NewColumn'] = np.arange(len(df['a'])) + 1
df['NewColumn'] = 'str_' + df['New_Column'].map(str)

547 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# in this post - no cython

%timeit df['NewColumn'] = ['str_%s'%i for i in range(n)]
409 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# cythonized list comp 

%timeit df['NewColumn'] = gen_list(1, len(df) + 1)
370 ms ± 9.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 2 :(得分:14)

提议的方法

在对字符串和数字dtypes进行了大量修改并利用它们之间的简单互操作性之后,我最终得到了一些内容来获得零填充字符串,因为NumPy运行良好并允许以这种方式进行矢量化操作 -

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

numexpr中布线以加快广播+模数操作 -

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

因此,要用作新列:

df['New_Column'] = create_inc_pattern(prefix_str='str_', start=1, stop=len(df)+1)

样品运行 -

In [334]: create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=14)
Out[334]: 
array(['str_01', 'str_02', 'str_03', 'str_04', 'str_05', 'str_06',
       'str_07', 'str_08', 'str_09', 'str_10', 'str_11', 'str_12', 'str_13'], 
      dtype='|S6')

In [338]: create_inc_pattern(prefix_str='str_', start=1, stop=124)
Out[338]: 
array(['str_001', 'str_002', 'str_003', 'str_004', 'str_005', 'str_006',
       'str_007', 'str_008', 'str_009', 'str_010', 'str_011', 'str_012',..
       'str_115', 'str_116', 'str_117', 'str_118', 'str_119', 'str_120',
       'str_121', 'str_122', 'str_123'], 
      dtype='|S7')

说明

逐步运行示例的基本思路和解释

基本思想是创建ASCII等效数字数组,可以通过dtype转换查看或转换为字符串1。更具体地说,我们将创建uint8类型的数字。因此,每个字符串将由一维数字数组表示。对于将转换为2D数组的字符串列表,每行(1D数组)表示单个字符串。

1)输入:

In [22]: prefix_str='str_'
    ...: start=15
    ...: stop=24

2)参数:

In [23]: N = stop - start # count of numbers
    ...: W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

In [24]: N,W
Out[24]: (9, 2)

3)创建代表起始字符串的一维数字数组:

In [25]: padv = np.full(W,48,dtype=np.uint8)
    ...: a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

In [27]: a0
Out[27]: array([115, 116, 114,  95,  48,  48], dtype=uint8)

4)扩展到覆盖字符串范围作为2D数组:

In [33]: a1 = np.repeat(a0[None],N,axis=0)
    ...: r = np.arange(start, stop)
    ...: addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    ...: a1[:,len(prefix_str):] += addn.astype(a1.dtype)

In [34]: a1
Out[34]: 
array([[115, 116, 114,  95,  49,  53],
       [115, 116, 114,  95,  49,  54],
       [115, 116, 114,  95,  49,  55],
       [115, 116, 114,  95,  49,  56],
       [115, 116, 114,  95,  49,  57],
       [115, 116, 114,  95,  50,  48],
       [115, 116, 114,  95,  50,  49],
       [115, 116, 114,  95,  50,  50],
       [115, 116, 114,  95,  50,  51]], dtype=uint8)

5)因此,每行代表一个字符串的ascii等价物,每个字符串都与所需的输出相关。让我们在最后一步得到它:

In [35]: a1.view('S'+str(a1.shape[1])).ravel()
Out[35]: 
array(['str_15', 'str_16', 'str_17', 'str_18', 'str_19', 'str_20',
       'str_21', 'str_22', 'str_23'], 
      dtype='|S6')

计时

这是针对列表理解版本的快速时序测试,该版本似乎是最好地查看来自其他帖子的时间 -

In [339]: N = 10000

In [340]: %timeit ['str_%s'%i for i in range(N)]
1000 loops, best of 3: 1.12 ms per loop

In [341]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
1000 loops, best of 3: 490 µs per loop

In [342]: N = 100000

In [343]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 14 ms per loop

In [344]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 4 ms per loop

Python-3代码

在Python-3上,要获取字符串dtype数组,我们需要在中间int dtype数组上填充更多的零。因此,没有和使用具有适用于Python-3的numexpr版本的等价物最终成为这些行的东西 -

方法#1(No numexpr):

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

方法#2(使用numexpr):

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

计时 -

In [8]: N = 100000

In [9]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 18.5 ms per loop

In [10]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 6.06 ms per loop

答案 3 :(得分:4)

一种可能的解决方案是将string转换为map

df['New_Column'] = np.arange(len(df['a']))+1
df['New_Column'] = 'str_' + df['New_Column'].map(str)