Question

我有一个带有约200个要素和3000行的datframe。这些数据样本的记录时间不同，基本上是每月一次，如下面的“ col101”示例中所示：

File "gdk.ml", line 346, characters 2-55:
346 |   external create : len:int -> t = "ml_point_array_new"
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error (warning 61): This primitive declaration uses type t, which is unannotated and
unboxable. The representation of such types may change in future
versions. You should annotate the declaration of t with [@@boxed]
or [@@unboxed].

在这些功能中，一些是累积数据，因此每个月的值都在增加。例如，col2和col100是我数据框中的累积特征。因此，我想为每个累积功能增加一列，与上个月相比有所不同。所以我想要的数据框应该是这样的：

   0    col1 (id)    col2.    col3   ….   col100    col101 (date)  …     col2000 (target value)
   1        001         653.    675   ….      343.3   01-02-2017.   …                1
   2        001         673.    432   ….      387.3   01-03-2017.   …            0
   3        001         679.    528   ….      401.2   01-04-2017.   …            1
   4        001         685     223   ….      503.4   01-05-2017.   …            1
   5        002         343     428   ….      432.5   01-02-2017.   …            0
   6        002         479.    421   ….      455.3   01-03-2017.   …            0
   7         …             …         …     ….          …               ….            …            ..

现在，这里有两个问题：1）如何自动识别具有200个特征的那些累积特征？以及如何为每个累积属性添加该额外功能（例如col22c和col100c）？有谁知道我该怎么办？

Answer 1

关于区分两列，您可以使用内置的diff()函数内置的pandas。 diff()计算每个元素与上一个元素的差。但是请注意，因为第一个元素没有任何先前的元素，所以diff()结果中的第一个元素将是NaN。因此，我们使用内置函数dropna()删除所有NaN的值。

但是对于检测累积列，我认为不会有任何办法。您可以找到所有一直在增加（单调）的列，但这并不意味着它们必然是累积的。

无论如何检测单调列，您都可以先获取它们的diff().dropna()，然后检查所有这些值是否都是正值：

df = some_data_frame
col_diff = df['some_column'].diff().dropna()
is_monotonic = all(col_diff > 0)

请注意，如果您忘记了dropna()，则all(col_diff > 0)的结果将始终为False（因为NaN是伪造的值）

在数据框中查找累积特征？

1 个答案: