Question

我正在运行MD模拟。我对系统中的集群增长感兴趣，因此模拟产生的数据形式如下：

nmax simtime
6    2.3e-9
7    7.1e-9
8    1.7e-9
11   1.1e-8
13   1.8e-8

其中nmax是存在的最大群集。显然，由于在1.7ns之后出现了大小为8的簇，因此在此之前必须存在大小为6和7的簇。此外，我希望数据包含丢失的簇大小，如1,2,3 ... 9,10。所以结果看起来像这样：

nmax simtime
1    1.7e-9
2    1.7e-9
...
6    1.7e-9
7    1.7e-9
8    1.7e-9
9    1.1e-8
10   1.1e-8
11   1.1e-8
12   1.8e-8
13   1.8e-8

我正在使用python 2.7和pandas。以前我使用shift函数制作新的移位simtime列，然后将这些新列与原始simtime列进行比较。如果后者的值小于2，则原始值将被移位的值替换。

数据量增加时出现问题。我使用的这种方法需要越来越多的移位列，这会产生丑陋且可能效率低下的代码。

所以：1）如何有效地修复不正确的simtime值; 2）包括原始数据文件中不存在的大小？

Answer 1

我认为转移是错误的方法，例如对于第一行，与下一个元素的比较将是有效的。 2.3e-9小于7.1e-9（假设我理解正确）。

相反，您可以向后迭代行，比较总是两个有效（意味着非纳米）simtime值。之后，您可以通过获取数据帧中的下一个非nan值来简单地重新索引并填充nans。

<强>代码

import pandas as pd
import numpy as np
import math

# prepare test data
df = pd.DataFrame({
        'nmax': [6, 7, 8, 11, 13],
        'simtime': [2.3e-9, 7.1e-9, 1.7e-9, 1.1e-8, 1.8e-8]
    });
df['nmax'] = df.nmax.astype(int)
df = df.set_index('nmax')
print(df)

# set illegal simtime values (bigger than next non-nan value) to nan
for i_curr, i_next in reversed(list(zip(df.index, df.index[1:]))):
    simtime = next(s for s in df.loc[i_next:].simtime if not math.isnan(s))
    if df.loc[i_curr].simtime > simtime:
        df.set_value(i_curr, 'simtime', np.nan)

# fill index with missing indices and fill nans by taking the next value
df = df.reindex(range(df.index[-1] + 1)).fillna(method='bfill')

print(df)

<强>结果

           simtime
nmax              
6     2.300000e-09
7     7.100000e-09
8     1.700000e-09
11    1.100000e-08
13    1.800000e-08
           simtime
nmax              
0     1.700000e-09
1     1.700000e-09
2     1.700000e-09
3     1.700000e-09
4     1.700000e-09
5     1.700000e-09
6     1.700000e-09
7     1.700000e-09
8     1.700000e-09
9     1.100000e-08
10    1.100000e-08
11    1.100000e-08
12    1.800000e-08
13    1.800000e-08

使用pandas添加和更正模拟数据？

1 个答案: