我有一个数据框,我希望将A列中的每一行除以A列的总和,并在数据框中创建一个新列。
Example:
Col A New Col
2 .22
3 .33
4 .44
Total = 9 1.00
我试图将Col A加起来然后尝试除以'Total',但因为Total不是一列而是一行,所以它不起作用。我只为新列中的每一行获取NaN。
df['New Col']= (df['ColA']/df.loc['Total'])
我知道您也可以在一行代码中集成总和计算,而不是创建总计行,但不确定如何做到这一点,并且无法在线找到任何内容。
df['New Col']= (df['ColA']/df.sum())
想法?
答案 0 :(得分:4)
df['new'] = df['ColA'] / df['ColA'].sum()
应该有效
答案 1 :(得分:2)
另一种方法是使用transform:
word_extract <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl = TRUE))
eng.reviews.list <- list()
for (i in 1:nrow(eng.reviews)) {
z <- word_extract(tolower(as.character(eng.reviews[i,1])))
eng.reviews.list[[i]] <- z
}
n.docs<-length(eng.reviews.list)
names(eng.reviews.list) <- c(1:n.docs)
reviews.vector <- VectorSource(eng.reviews.list)
reviews.vector$Names <- names(eng.reviews.list)
reviews.corpus <- Corpus(reviews.vector)
reviews.corpus <- tm_map(reviews.corpus, removeNumbers)
reviews.corpus <- tm_map(reviews.corpus, stemDocument)
reviews.corpus <- tm_map(reviews.corpus, removePunctuation)
reviews.corpus <- tm_map(reviews.corpus, stripWhitespace)
reviews.corpus <- tm_map(reviews.corpus, removeWords, c(stopwords("english"),
"can", "anything", "everything",
"every", "any", "c", "the",
"something"))
tdm <- DocumentTermMatrix(reviews.corpus)
tdm.tfidf <- weightTfIdf(tdm)
tdm.tfidf <- removeSparseTerms(tdm.tfidf, 0.999)
tfidf.matrix <- as.matrix(tdm.tfidf)
dist.matrix.jaccard <- proxy::dist(tfidf.matrix, method = "Jaccard")
set.seed(sample(1:1000, 1))
wss.summary <- c()
clust.improvement <- c()
stop.clustering <- c()
i <- 1
for (k in 1:15) {
iters = 200
kmeans_model <- kmeans(dist.matrix.jaccard,
centers = k, iter.max = iters, algorithm = "Forgy")
wss.summary[k] <- kmeans_model$tot.withinss
if (k>1) {
clust.improvement <- (wss.summary[k-1]-wss.summary[k])/kmeans_model$totss
#print(paste0('The reduction of the Sum of Squares within the clusters = ',
# round(clust.improvement*100, 2),' %'))
if (clust.improvement < 0.01 && k > 6) {
stop.clustering[i] <- k-1
i <- i+1
}
}
}
plot(1:15, wss.summary, type="b",
xlab="Number of Clusters",ylab="Within groups sum of squares")
recommended.clusters <- min(stop.clustering)
print(paste0('The recommended number of clusters: ', recommended.clusters))
points <- cmdscale(dist.matrix.jaccard, k = 2)
kmeans_model <-kmeans(dist.matrix.jaccard,
centers = recommended.clusters,
iter.max = iters, algorithm = "Forgy")```
I am not sure which part of this code should be a reactive variable, if its one reactive variable or multiple. Wss summary and ngrams based on cluster number will be ploted
答案 2 :(得分:1)
你非常接近。您想在sum()
系列
Col A
df['New Col'] = df['Col A']/df['Col A'].sum()
结果显示如下所示的数据框:
>>> df
Col A New Col
0 2 0.222222
1 3 0.333333
2 4 0.444444
现在,如果你df.sum()
,你会得到一个包含每列总数的系列:
>>> df.sum()
Col A 9.0
New Col 1.0
dtype: float64