我有一个CSV
文件,它是通过以下代码作为dask数据帧导入的:
import dask.dataframe as dd
df = dd.read_csv("name and path of the file.csv")
df.head(10)
输出
+-----+------+-----+
|col1 | col2 | col3|
+-----+------+-----+
| A | 2 | 4 |
+-----+------+-----+
| A | 4 | 5 |
+-----+------+-----+
| A | 7 | 7 |
+-----+------+-----+
| A | 3 | 8 |
+-----+------+-----+
| A | 7 | 3 |
+-----+------+-----+
| B | 8 | 9 |
+-----+------+-----+
| B | 10 | 10 |
+-----+------+-----+
| B | 8 | 9 |
+-----+------+-----+
| B | 20 | 15 |
+-----+------+-----+
我要创建另一列col4
,其中col1中的每个组分别包含col2[n+3]/col2-1
。
输出应为
+-----+------+-----+-----+
|col1 | col2 | col3| col4|
+-----+------+-----+-----+
| A | 2 | 4 | 0.5| #(3/2-1)
+-----+------+-----+-----+
| A | 4 | 5 | 0.75| #(7/4-1)
+-----+------+-----+-----+
| A | 7 | 7 | NA |
+-----+------+-----+-----+
| A | 3 | 8 | NA |
+-----+------+-----+-----+
| A | 7 | 3 | NA |
+-----+------+-----+-----+
| B | 8 | 9 | 1.5 |
+-----+------+-----+-----+
| B | 10 | 10 | NA |
+-----+------+-----+-----+
| B | 8 | 9 | NA |
+-----+------+-----+-----+
| B | 20 | 15 | NA |
+-----+------+-----+-----+
我们可以在熊猫上执行以下任务
df['col4'] = df.groupby('col1')['col2'].transform(lambda x: x.shift(-3)) / df['col2'] - 1
,但不能立即使用。任何帮助将不胜感激
答案 0 :(得分:0)
在此PR:https://github.com/dask/dask/pull/1769中,diff方法现已添加到DataFrame和Series中。与熊猫一样。
此外,我只是要您在仅提供索引的地方使用diff
我想已经有一个用dask实现Shift()的任务了...。我已经提供了上面的链接...我希望它将回答您的问题