我有一个数据框如下,只是一个例子。
date y w diff
2010-1-1 3 1 3
2010-1-2 4 1 4
2010-1-3 5 1 2
2010-1-4 6 2 5
2010-1-5 7 2 6
2010-1-6 8 2 5
2010-1-7 9 3 2
2010-1-8 10 4 4
2010-1-9 11 5 5
2010-1-10 12 6 6
2010-1-11 13 5 6
现在例如我是数据帧的索引,我想为数据帧添加新列,有三个新列名称,如p1,p2,p3,但值是前两个日期的值。当然,前两行值p1,p2是Nan。从3-5行开始,p1,p2的值都是3,4,而p3的值是前两行的最后一个diff的值,我的意思是3-5行,p3的值都是4.我用的是五行作为一个时期。我的意思是8-10行,p1,p2,p3的值是8,9,2。新的数据帧如下:
date y w diff p1 p2 p3
2010-1-1 3 1 3 Nan Nan Nan
2010-1-2 4 1 4 Nan Nan Nan
2010-1-3 5 1 2 3 4 4
2010-1-4 6 2 5 3 4 4
2010-1-5 7 2 6 3 4 4
2010-1-6 8 2 5 Nan Nan Nan
2010-1-7 9 3 2 Nan Nan Nan
2010-1-8 10 4 4 8 9 2
2010-1-9 11 5 5 8 9 2
2010-1-10 12 6 6 8 9 2
2010-1-11 13 5 6 Nan Nan Nan
如果有什么东西你不理解我的问题,请评论。谢谢!
答案 0 :(得分:1)
您可以使用由array g
创建的arange
df['date'] = pd.to_datetime(df['date'])
g = np.arange(len(df.index)) // 5
def f(x):
x = x.shift(2)
a = x.values
if a.shape[0] > 3:
a[3,1] = a[3, 0]
a[3,0] = a[2, 0]
a[2] = a[3]
a[4] = a[3]
return pd.DataFrame(a, index=x.index, columns=['p1','p2','p3'])
df1 = df.groupby(g)['y','w','diff'].apply(f)
print (df1)
p1 p2 p3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 4.0 4.0
3 3.0 4.0 4.0
4 3.0 4.0 4.0
5 NaN NaN NaN
6 NaN NaN NaN
7 8.0 9.0 2.0
8 8.0 9.0 2.0
9 8.0 9.0 2.0
10 NaN NaN NaN
和使用自定义函数groupby
的楼层划分,然后按要求在numpy数组中设置值。最后由shift
添加到原始文件:
df2 = df.join(df1)
print (df2)
date y w diff p1 p2 p3
0 2010-01-01 3 1 3 NaN NaN NaN
1 2010-01-02 4 1 4 NaN NaN NaN
2 2010-01-03 5 1 2 3.0 4.0 4.0
3 2010-01-04 6 2 5 3.0 4.0 4.0
4 2010-01-05 7 2 6 3.0 4.0 4.0
5 2010-01-06 8 2 5 NaN NaN NaN
6 2010-01-07 9 3 2 NaN NaN NaN
7 2010-01-08 10 4 4 8.0 9.0 2.0
8 2010-01-09 11 5 5 8.0 9.0 2.0
9 2010-01-10 12 6 6 8.0 9.0 2.0
10 2010-01-11 13 5 6 NaN NaN NaN
library(ICSNP)
library(ggbiplot)
data(iris)
# Mahalanobis Distance calculation Function from https://stackoverflow.com/a/34708113/5731401
D.sq <- function (g1, g2) {
dbar <- as.vector(colMeans(g1) - colMeans(g2))
S1 <- cov(g1)
S2 <- cov(g2)
n1 <- nrow(g1)
n2 <- nrow(g2)
V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
D.sq <- t(dbar) %*% solve(V) %*% dbar
res <- list()
res$D.sq <- D.sq
res$V <- V
res
}
iris.pca <- prcomp(iris[,-5], center = TRUE, scale. = TRUE)
str(iris)
# uncomment the next line for illustrative plot
# print(ggbiplot(iris.pca, obs.scale = 1, var.scale = 1, groups = iris$Species, ellipse = TRUE, circle = TRUE))
df.iris.x <- as.data.frame(iris.pca$x)
df.iris.x$Species <- iris$Species
split.data = split(df.iris.x[,-5],df.iris.x$Species)
S1 = split.data[['setosa']]
S2 = split.data[['versicolor']]
S3 = split.data[['virginica']]
# calculate mahalanobis distances for the first two principal components between the groups/species
d1 <- D.sq(S1[,1:2],S2[,1:2])
d2 <- D.sq(S1[,1:2],S3[,1:2])
d3 <- D.sq(S2[,1:2],S3[,1:2])
# T-test on the first two principal components
HotellingsT2(S1[,1:2], S2[,1:2]) #btw setosa and versicolor
HotellingsT2(S1[,1:2], S3[,1:2]) #btw setosa and virginica
HotellingsT2(S2[,1:2], S3[,1:2]) #btw versicolor and virginica