我有一些我用数据帧和pandas处理的数据。 它们包含大约10 000行和6列。
问题是,我做了几次试验,不同的数据集的索引号略有不同。 (这是"力 - 长度"使用多种材料进行测试,当然测量点并不完美。)
现在我的想法是,#34;重新采样"使用包含长度值的索引的数据。 似乎pandas中的重新采样功能仅适用于datetime数据类型。
我尝试通过to_datetime转换索引并成功。但是在重新取样之后,我需要回到原来的规模。某种from_datetime函数。
有没有办法,或者我是在完全错误的轨道上,应该更好地使用像groupby这样的功能?
编辑以添加:
数据如下所示。长度用作索引。在那些数据帧中,我有一些,所以它们真的很好将它们全部对齐#34;帧速率"然后剪掉它们,例如这样我就可以比较不同的数据集。
我已经尝试过的想法就是这个:
df_1_dt = df_1 #generate a table for the conversion
df_1_dt.index = pd.to_datetime(df_1_dt.index, unit='s') # convert it simulating seconds.. good idea?!
df_1_dt_rs= df_1_dt # generate a df for the resampling
df_1_dt_rs = df_1_dt_rs.resample (rule='s') #resample by the generatet time
数据:
+---------------------------------------------------+
¦ Index (Lenght) ¦ Force1 ¦ Force2 ¦
¦-------------------+---------------+---------------¦
¦ 8.04662074828e-06 ¦ 4.74251270294 ¦ 4.72051584721 ¦
¦ 8.0898882798e-06 ¦ 4.72051584721 ¦ 4.72161570191 ¦
¦ 1.61797765596e-05 ¦ 4.69851899147 ¦ 4.72271555662 ¦
¦ 1.65476570973e-05 ¦ 4.65452528 ¦ 4.72491526604 ¦
¦ 2.41398605024e-05 ¦ 4.67945501539 ¦ 4.72589291467 ¦
¦ 2.42696630876e-05 ¦ 4.70438475079 ¦ 4.7268705633 ¦
¦ 9.60953101751e-05 ¦ 4.72931448619 ¦ 4.72784821192 ¦
¦ 0.00507703541206 ¦ 4.80410369237 ¦ 4.73078115781 ¦
¦ 0.00513927175509 ¦ 4.87889289856 ¦ 4.7337141037 ¦
¦ 0.00868965311878 ¦ 4.9349848032 ¦ 4.74251282215 ¦
¦ 0.00902026197556 ¦ 4.99107670784 ¦ 4.7513115406 ¦
¦ 0.00929150878827 ¦ 5.10326051712 ¦ 4.76890897751 ¦
¦ 0.0291729332784 ¦ 5.14945375919 ¦ 4.78650641441 ¦
¦ 0.0296332588857 ¦ 5.17255038023 ¦ 4.79530513287 ¦
¦ 0.0297080942518 ¦ 5.19564700127 ¦ 4.80410385132 ¦
¦ 0.0362595526707 ¦ 5.2187436223 ¦ 4.80850321054 ¦
¦ 0.0370305483177 ¦ 5.24184024334 ¦ 4.81290256977 ¦
¦ 0.0381506204153 ¦ 5.28803348541 ¦ 4.82170128822 ¦
¦ 0.0444440795306 ¦ 5.30783069134 ¦ 4.83050000668 ¦
¦ 0.0450121369102 ¦ 5.3177292943 ¦ 4.8348993659 ¦
¦ 0.0453465140473 ¦ 5.32762789726 ¦ 4.83929872513 ¦
¦ 0.0515533437013 ¦ 5.33752650023 ¦ 4.85359662771 ¦
¦ 0.05262489708 ¦ 5.34742510319 ¦ 4.8678945303 ¦
¦ 0.0541273847206 ¦ 5.36722230911 ¦ 4.89649033546 ¦
¦ 0.0600755845953 ¦ 5.37822067738 ¦ 4.92508614063 ¦
¦ 0.0607712385295 ¦ 5.38371986151 ¦ 4.93938404322 ¦
¦ 0.0612954159368 ¦ 5.38921904564 ¦ 4.9536819458 ¦
¦ 0.0670288249293 ¦ 5.39471822977 ¦ 4.97457891703 ¦
¦ 0.0683640870058 ¦ 5.4002174139 ¦ 4.99547588825 ¦
¦ 0.0703192637772 ¦ 5.41121578217 ¦ 5.0372698307 ¦
¦ 0.0757871634772 ¦ 5.43981158733 ¦ 5.07906377316 ¦
¦ 0.0766597757545 ¦ 5.45410948992 ¦ 5.09996074438 ¦
¦ 0.077317850103 ¦ 5.4684073925 ¦ 5.12085771561 ¦
¦ 0.0825991083545 ¦ 5.48270529509 ¦ 5.13295596838 ¦
¦ 0.0841354654428 ¦ 5.49700319767 ¦ 5.14505422115 ¦
¦ 0.0865525182528 ¦ 5.52559900284 ¦ 5.1692507267 ¦
+---------------------------------------------------+
答案 0 :(得分:1)
听起来你要做的就是将长度数字缩小到更低的精度。
如果是这种情况,您可以使用内置的舍入功能:
(虚拟数据)
>>> df=pd.DataFrame([[1.0000005,4],[1.232463632,5],[5.234652,9],[5.675322,10]],columns=['length','force'])
>>> df
33: length force
0 1.000001 4
1 1.232464 5
2 5.234652 9
3 5.675322 10
>>> df['rounded_length'] = df.length.apply(round, ndigits=0)
>>> df
34: length force rounded_length
0 1.000001 4 1.0
1 1.232464 5 1.0
2 5.234652 9 5.0
3 5.675322 10 6.0
>>>
然后你可以使用groupby复制resample()....工作流程:
>>> df.groupby('rounded_length').mean().force
35: rounded_length
1.0 4.5
5.0 9.0
6.0 10.0
Name: force, dtype: float64
通常,重新采样仅适用于日期。如果您将其用于日期以外的其他内容,则可能是更优雅的解决方案!
答案 1 :(得分:1)
我发现了如何通过使用重新索引和内插来做到这一点。
结果是:蓝点是原始数据,红线是重新索引/内插的数据。
这是代码
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'X' : [1.1, 2.05, 3.07, 4.2],
'Y1': [10.1, 15.2, 35.3, 40.4],
'Y2': [55.05, 40.4, 84.17, 31.5]})
print(df)
df.set_index('X',inplace =True)
print(df)
Xresampled = np.linspace(1,4,15)
print(Xresampled)
#Resampling
#df = df.reindex(df.index.union(resampling))
#Interpolation technique to use. One of:
#'linear': Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
#'time': Works on daily and higher resolution data to interpolate given length of interval.
#'index', 'values': use the actual numerical values of the index.
#'pad': Fill in NaNs using existing values.
#'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'spline', 'barycentric', 'polynomial': Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both 'polynomial' and 'spline' require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
#'krogh', 'piecewise_polynomial', 'spline', 'pchip', 'akima': Wrappers around the SciPy interpolation methods of similar names. See Notes.
#'from_derivatives': Refers to scipy.interpolate.BPoly.from_derivatives which replaces 'piecewise_polynomial' interpolation method in scipy 0.18.
df_resampled = df.reindex(df.index.union(Xresampled)).interpolate('values').loc[Xresampled]
print(df_resampled)
# gca stands for 'get current axis'
ax = plt.gca()
df.plot( style='X', y='Y2', color = 'blue', ax=ax, label = 'Original Data' )
df_resampled.plot( style='.-', y='Y2', color = 'red', ax=ax, label = 'Interpolated Data' )
ax.set_ylabel('Y1')
plt.show()
答案 2 :(得分:0)
我遇到了一个非常相似的问题,并且找到了解决方案。解决方案本质上是
Integreate-> Interpolate->微分
首先,我将描述要解决的问题以确保我们位于同一页面上。一个简单的例子是,如果您有点(x1,y1)和(x2,y2),并且想要(x0',y0'),(x1',y1')(您知道x0',x1'和虚弱地寻找) y1'),x0' 做到这一点的方法是积分,然后插值,然后求微分。假设您有一个包含列 我将以 我希望这可以解决您的问题。我没有提供证明该方法可以解决上面定义的问题的证据,但这并不难证明。'x'
和'y'
的数据框,但是您想重新采样到一个新的x new_x
,即numpy.ndarray。df['integral'] = (df['y'] * (df['x'] - df['x'].shift(1))).cumsum()
new_integral = np.interp(new_x, df['x'].values, df['integral'].values, left=0., right=np.nan)
new = pd.DataFrame({'new_x': new_x, 'integral': new_integral})
new['y'] = (new['integral'] - new['integral'].shift(1)) / (new['new_x'] - new['new_x'].shift(1))
new_x
开始0.
,然后从新数据框中删除第一个值,因为它将是NaN
。您还可以根据需要填充和尾随NaN
到高x处。