python2.7 dataframe:从现有列值中添加新列

时间:2017-06-20 09:51:22

标签: python-2.7 dataframe

我有一个数据框如下,只是一个例子。

date       y     w   diff
 2010-1-1   3     1    3
 2010-1-2   4     1    4
 2010-1-3   5     1    2
 2010-1-4   6     2    5
 2010-1-5   7     2    6
 2010-1-6   8     2    5
 2010-1-7   9     3    2
 2010-1-8   10    4    4
 2010-1-9   11    5    5
 2010-1-10  12    6    6
 2010-1-11  13    5    6
现在例如我是数据帧的索引,我想为数据帧添加新列,有三个新列名称,如p1,p2,p3,但值是前两个日期的值。当然,前两行值p1,p2是Nan。从3-5行开始,p1,p2的值都是3,4,而p3的值是前两行的最后一个diff的值,我的意思是3-5行,p3的值都是4.我用的是五行作为一个时期。我的意思是8-10行,p1,p2,p3的值是8,9,2。新的数据帧如下:

 date       y     w   diff  p1  p2  p3
 2010-1-1   3     1    3    Nan Nan Nan
 2010-1-2   4     1    4    Nan Nan Nan
 2010-1-3   5     1    2    3   4   4
 2010-1-4   6     2    5    3   4   4
 2010-1-5   7     2    6    3   4   4
 2010-1-6   8     2    5    Nan Nan Nan 
 2010-1-7   9     3    2    Nan Nan Nan
 2010-1-8   10    4    4    8   9    2
 2010-1-9   11    5    5    8   9    2
 2010-1-10  12    6    6    8   9    2
 2010-1-11  13    5    6    Nan Nan Nan

如果有什么东西你不理解我的问题,请评论。谢谢!

1 个答案:

答案 0 :(得分:1)

您可以使用由array g创建的arange df['date'] = pd.to_datetime(df['date']) g = np.arange(len(df.index)) // 5 def f(x): x = x.shift(2) a = x.values if a.shape[0] > 3: a[3,1] = a[3, 0] a[3,0] = a[2, 0] a[2] = a[3] a[4] = a[3] return pd.DataFrame(a, index=x.index, columns=['p1','p2','p3']) df1 = df.groupby(g)['y','w','diff'].apply(f) print (df1) p1 p2 p3 0 NaN NaN NaN 1 NaN NaN NaN 2 3.0 4.0 4.0 3 3.0 4.0 4.0 4 3.0 4.0 4.0 5 NaN NaN NaN 6 NaN NaN NaN 7 8.0 9.0 2.0 8 8.0 9.0 2.0 9 8.0 9.0 2.0 10 NaN NaN NaN 和使用自定义函数groupby的楼层划分,然后按要求在numpy数组中设置值。最后由shift添加到原始文件:

df2 = df.join(df1)
print (df2)
         date   y  w  diff   p1   p2   p3
0  2010-01-01   3  1     3  NaN  NaN  NaN
1  2010-01-02   4  1     4  NaN  NaN  NaN
2  2010-01-03   5  1     2  3.0  4.0  4.0
3  2010-01-04   6  2     5  3.0  4.0  4.0
4  2010-01-05   7  2     6  3.0  4.0  4.0
5  2010-01-06   8  2     5  NaN  NaN  NaN
6  2010-01-07   9  3     2  NaN  NaN  NaN
7  2010-01-08  10  4     4  8.0  9.0  2.0
8  2010-01-09  11  5     5  8.0  9.0  2.0
9  2010-01-10  12  6     6  8.0  9.0  2.0
10 2010-01-11  13  5     6  NaN  NaN  NaN
library(ICSNP)
library(ggbiplot)
data(iris)

# Mahalanobis Distance calculation Function from https://stackoverflow.com/a/34708113/5731401
D.sq <- function (g1, g2) {
    dbar <- as.vector(colMeans(g1) - colMeans(g2))
    S1 <- cov(g1)
    S2 <- cov(g2)
    n1 <- nrow(g1)
    n2 <- nrow(g2)
    V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
    D.sq <- t(dbar) %*% solve(V) %*% dbar
    res <- list()
    res$D.sq <- D.sq
    res$V <- V
    res
}

iris.pca <- prcomp(iris[,-5], center = TRUE, scale. = TRUE)
str(iris)
# uncomment the next line for illustrative plot
# print(ggbiplot(iris.pca, obs.scale = 1, var.scale = 1, groups = iris$Species, ellipse = TRUE, circle = TRUE))
df.iris.x <- as.data.frame(iris.pca$x)
df.iris.x$Species <- iris$Species

split.data = split(df.iris.x[,-5],df.iris.x$Species)
S1 = split.data[['setosa']]
S2 = split.data[['versicolor']]
S3 = split.data[['virginica']]

# calculate mahalanobis distances for the first two principal components between the groups/species
d1 <- D.sq(S1[,1:2],S2[,1:2])
d2 <- D.sq(S1[,1:2],S3[,1:2])
d3 <- D.sq(S2[,1:2],S3[,1:2])

# T-test on the first two principal components 
HotellingsT2(S1[,1:2], S2[,1:2]) #btw setosa and versicolor
HotellingsT2(S1[,1:2], S3[,1:2]) #btw setosa and virginica
HotellingsT2(S2[,1:2], S3[,1:2]) #btw versicolor and virginica