矢量化pandas.DataFrame的集成

时间:2015-12-31 15:53:24

标签: python numpy pandas vectorization numerical-integration

我有DataFrame力 - 位移数据。置换数组已设置为DataFrame索引,列是不同测试的各种力曲线。

如何计算完成的工作(“曲线下面积”)?

我看了numpy.trapz这似乎做了我需要的东西,但我认为我可以避免像这样循环遍历每一列:

import numpy as np
import pandas as pd 

forces = pd.read_csv(...)
work_done = {}

for col in forces.columns:
    work_done[col] = np.trapz(forces.loc[col], forces.index))

我希望在曲线下创建一个新的DataFrame区域,而不是dict,并认为DataFrame.apply()或某些东西可能是合适的,但不知道在哪里开始寻找。

简而言之:

  1. 我可以避免循环吗?
  2. 我可以直接创建DataFrame工作吗?
  3. 提前感谢您的帮助。

2 个答案:

答案 0 :(得分:5)

您可以通过将整个import numpy as np import pandas as pd # some random input data gen = np.random.RandomState(0) x = gen.randn(100, 10) names = [chr(97 + i) for i in range(10)] forces = pd.DataFrame(x, columns=names) # vectorized version wrk = np.trapz(forces, x=forces.index, axis=0) work_done = pd.DataFrame(wrk[None, :], columns=forces.columns) # non-vectorized version for comparison work_done2 = {} for col in forces.columns: work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)}) 传递给np.trapz并指定from pprint import pprint pprint(work_done.T) # 0 # a -24.331560 # b -10.347663 # c 4.662212 # d -12.536040 # e -10.276861 # f 3.406740 # g -3.712674 # h -9.508454 # i -1.044931 # j 15.165782 pprint(work_done2) # {'a': -24.331559643023006, # 'b': -10.347663159421426, # 'c': 4.6622123535050459, # 'd': -12.536039649161403, # 'e': -10.276861220217308, # 'f': 3.4067399176289994, # 'g': -3.7126739591045541, # 'h': -9.5084536839888187, # 'i': -1.0449311137294459, # 'j': 15.165781517623724} 参数来对此进行矢量化,例如:

col

这些提供以下输出:

.loc[:, col]

您的原始示例还有其他一些问题。 .loc[col]是列名而不是行索引,因此需要索引数据框的第二维(即DataFrame而不是np.trapz)。此外,最后一行还有一个额外的尾随括号。

编辑:

可以直接通过.apply work_done = forces.apply(np.trapz, axis=0, args=(forces.index,)) 向每列生成输出np.trapz,例如:

.apply

然而,这并不是真正“正确”的矢量化 - 您仍在每列上分别调用np.trapz。您可以通过直接比较In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,)) 1000 loops, best of 3: 582 µs per loop In [2]: %timeit np.trapz(forces, x=forces.index, axis=0) The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached 10000 loops, best of 3: 53.4 µs per loop 版本与调用DataFrame的速度来看到这一点:

 <script>
window.onload = function () {
    var styles = [
        {
            featureType: 'water',
            elementType: 'all',
            stylers: [
                { hue: '#7da6d3' },
                { saturation: 8 },
                { lightness: -13 },
                { visibility: 'on' }
            ]
        },{
            featureType: 'landscape.man_made',
            elementType: 'all',
            stylers: [
                { hue: '#ffffff' },
                { saturation: -100 },
                { lightness: 100 },
                { visibility: 'on' }
            ]
        },{
            featureType: 'road',
            elementType: 'all',
            stylers: [
                { hue: '#7e90ad' },
                { saturation: -78 },
                { lightness: -8 },
                { visibility: 'simplified' }
            ]
        },{
            featureType: 'poi.park',
            elementType: 'all',
            stylers: [
                { hue: '#83cca5' },
                { saturation: -3 },
                { lightness: -16 },
                { visibility: 'simplified' }
            ]
        },{
            featureType: 'poi.school',
            elementType: 'all',
            stylers: [
                { hue: '#dddddd' },
                { saturation: -100 },
                { lightness: 22 },
                { visibility: 'on' }
            ]
        },{
            featureType: 'poi.place_of_worship',
            elementType: 'all',
            stylers: [
                { hue: '#dddddd' },
                { saturation: -100 },
                { lightness: 11 },
                { visibility: 'simplified' }
            ]
        },{
            featureType: 'poi.business',
            elementType: 'geometry',
            stylers: [
                { hue: '#96A6C5' },
                { saturation: 16 },
                { lightness: -20 },
                { visibility: 'on' }
            ]
        },{
            featureType: 'transit',
            elementType: 'geometry',
            stylers: [
                { hue: '#7da6d3' },
                { saturation: 49 },
                { lightness: -12 },
                { visibility: 'on' }
            ]
        }
    ];

var options = {
    mapTypeControlOptions: {
        mapTypeIds: ['Styled']
    },
    center: new google.maps.LatLng(39.9534988, -75.1748003),
    zoom: 16,
    disableDefaultUI: false,
    mapTypeId: 'Styled'
};
var div = document.getElementById('googleMap');
var map = new google.maps.Map(div, options);

var building = {lat: 39.9534988, lng: -75.1748003};
var image = 'images/marker.jpg';
var marker = new google.maps.Marker({
    position: building,
    map: map,
    icon: image
});



var styledMapType = new google.maps.StyledMapType(styles, { name: '1919 Market' });
map.mapTypes.set('Styled', styledMapType);
}

var request = {
     location: building,
     radius: '500',
     query: 'restaurant'
 };

var service = new google.maps.places.PlacesService(map);
service.radarSearch(request, callback);


</script>

这不是一个完全公平的比较,因为第二个版本排除了从输出numpy数组构造{{1}}所花费的额外时间,但是这仍然应该小于执行实际整合。

答案 1 :(得分:0)

以下是使用梯形规则获取数据框列的累积积分的方法。或者,以下创建一个pandas.Series方法,用于选择Trapezoidal,Simpson或Romberger的规则(source):

import pandas as pd
from scipy import integrate
import numpy as np

#%% Setup Functions

def integrate_method(self, how='trapz', unit='s'):
    '''Numerically integrate the time series.

    @param how: the method to use (trapz by default)
    @return 

    Available methods:
     * trapz - trapezoidal
     * cumtrapz - cumulative trapezoidal
     * simps - Simpson's rule
     * romb - Romberger's rule

    See http://docs.scipy.org/doc/scipy/reference/integrate.html for the method details.
    or the source code
    https://github.com/scipy/scipy/blob/master/scipy/integrate/quadrature.py
    '''
    available_rules = set(['trapz', 'cumtrapz', 'simps', 'romb'])
    if how in available_rules:
        rule = integrate.__getattribute__(how)
    else:
        print('Unsupported integration rule: %s' % (how))
        print('Expecting one of these sample-based integration rules: %s' % (str(list(available_rules))))
        raise AttributeError

    if how is 'cumtrapz':
        result = rule(self.values)
        result = np.insert(result, 0, 0, axis=0)        
    else: 
        result = rule(self.values)
    return result

pd.Series.integrate = integrate_method

#%% Setup (random) data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
df = pd.DataFrame(x, columns=names)


#%% Cummulative Integral
df_cummulative_integral = df.apply(lambda x: x.integrate('cumtrapz'))
df_integral = df.apply(lambda x: x.integrate('trapz'))

df_do_they_match = df_cummulative_integral.tail(1).round(3) == df_integral.round(3)

if df_do_they_match.all().all():
    print("Trapz produces the last row of cumtrapz")