我有一个大型数据框(> 400万行),其中包含存储字符串的yname2
,yname3
,yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee
列:
yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa | 3 | 1 | 1
aaabbb | cccccc | aaaaaa | 2 | 2 | 3
aaaaaa | aaabbb | dddddd | 3 | 2 | 2
cccccc | dddddd | eeeeee | 2 | 2 | 1
现在我想计算所有列中每个字符串的总出现次数。这些应作为附加列添加:
data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
我已经编写了以下代码,完成了这项工作:
@Select(value= "{ CALL getTotalCityStateId()}")
@Options(statementType = StatementType.CALLABLE)
@ResultType(State.class)
@Results(
{
@Result(property="id", column="state_id"),
@Result(property="name", column="state_name"),
@Result(property="code", column="state_code"),
})
List<State> callGetStatesAnnotations();
然而,这非常缓慢,需要数天才能计算出来。我有什么想法可以加快速度吗?
答案 0 :(得分:6)
data.table
方法怎么样:
library(data.table)
setDT(d)
lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
# value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1
d[, paste0("rcount", 1:3) :=
lapply(d, function(x) lookup[x, , on = .(value)][, N])]
# yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa 3 1 1
#2: aaabbb cccccc aaaaaa 2 2 3
#3: aaaaaa aaabbb dddddd 3 2 2
#4: cccccc dddddd eeeeee 2 2 1
Microbenchmark输出复制来自bgoldst的例子,但有400,000行。
Unit: seconds
expr min lq mean median uq max neval
bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697 3
alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850 3
jota(dt) 2.629337 2.692613 2.719207 2.755889 2.764141 2.772394 3
mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702 3
coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334 3
espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306 3
答案 1 :(得分:6)
在基础R中,您可以构建一个包含data.frame的未列出值的表,并按值对其进行索引。确保你索引的是一个字符串,而不是一个因子(因此是as.character
),或者它将被数字而不是名称索引。
data.frame(df,
lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
# yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
如果data.frame足够大而且速度很慢,那么您可以在lapply
之外构建表,因此它只运行一次:
df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))
你也可以把它放在dplyr
中,这使它更具可读性:
# look up times repeated
df %>% mutate_each(funs(table(unlist(df))[as.character(.)])) %>% # or mutate_each(funs(df_table[as.character(.)]))
# fix column names
select(rcount = starts_with('yname')) %>%
# add original df back in
bind_cols(df, .)
# Source: local data frame [4 x 6]
#
# yname1 yname2 yname3 rcount1 rcount2 rcount3
# (fctr) (fctr) (fctr) (tabl) (tabl) (tabl)
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
df <- structure(list(yname1 = c("aaaaaa", "aaabbb", "aaaaaa", "cccccc"
), yname2 = c("bbbaaa", "cccccc", "aaabbb", "dddddd"), yname3 = c("bbaaaa",
"aaaaaa", "dddddd", "eeeeee")), .Names = c("yname1", "yname2",
"yname3"), row.names = c(NA, -4L), class = "data.frame")
答案 2 :(得分:5)
已经有一些好的解决方案,但没有人使用match()
来查找预先计算的频率表中的每个字符串。以下是如何做到这一点。请注意,我选择as.matrix()
为yname*
的参数和table()
的第一个参数生成match()
列的矩阵。
cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
## yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa 3 1 1
## 2 aaabbb cccccc aaaaaa 2 2 3
## 3 aaaaaa aaabbb dddddd 3 2 2
## 4 cccccc dddddd eeeeee 2 2 1
更新:我无法相信我之前错过了这个,但表达方式
cnts[match(m,names(cnts))]
可以替换为
cnts[m]
因此根本不需要调用match()
。
我只是重新评估基准测试,发现它并没有以任何显着的方式改变我的解决方案的运行时间(可能只是在小规模测试中略微加速)。据推测,这是因为索引带有字符名称的向量需要在内部使用相同类型的match()
逻辑,因此上述替换不会获得任何性能。但我认为简洁和简洁的改进是值得的。
我应该注意到,我对其他一些解决方案进行了一些小的修改,以便产生这些基准测试结果。最值得注意的是,我想避免为重复执行复制任何输入,但由于data.tables通过引用传递,我不得不修改jota()
以使其成为幂等的。这涉及仅对目标yname*
列进行过滤,我通过cns
调用将其预先计算为名为grep()
的局部变量,就像我在自己的解决方案中一样。为了公平起见,我向所有解决方案添加了相同的grep()
调用和过滤逻辑,但markus()
除外,因为它显式处理,因此不需要它每列分开。我还将lookup
中的jota()
上的索引连接操作更改为lookup[.(value=x),,on='value']
,因为它对我来说不起作用。最后,对于mhairi()
,我通过在所有Reduce()
列中添加yname*
调用来完成解决方案。
library(microbenchmark);
library(data.table);
library(dplyr);
bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(df) 491.373 544.6165 597.4743 575.8350 609.192 2054.872 100
## markus(df) 375.907 435.5645 463.7258 467.4250 489.022 549.962 100
## alistaire(df) 754.380 816.1755 849.8749 840.3385 888.021 959.654 100
## jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930 100
## mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045 100
## coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441 100
## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(df) 85.20766 87.00487 88.39154 88.80209 89.98348 91.16487 3
## markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390 3
## alistaire(df) 83.03348 83.10276 83.18116 83.17204 83.25500 83.33797 3
## jota(dt) 12.49174 13.82088 14.44939 15.15002 15.42821 15.70640 3
## mhairi(df) 156.06459 156.36608 158.27256 156.66758 159.37654 162.08551 3
## coffein(df) 154.02853 154.97215 156.52246 155.91576 157.76942 159.62309 3
答案 3 :(得分:3)
我更喜欢上面的答案,但为了完整性,让我添加一个替代方案,它基于使用唯一字符串作为rownames:
public partial class Form1 : Form
{
public DataTable dt;
public Form1()
{
InitializeComponent();
dt = new DataTable("TestTable");
dt.Columns.Add("Duration", typeof(TimeSpan));
DataRow dr = dt.NewRow();
dr.ItemArray = new object[] { new TimeSpan(1, 1, 1, 1) };
dt.Rows.Add(dr);
dataGridView1.DataSource = dt;
this.dataGridView1.Columns["Duration"].DefaultCellStyle.Format = "l";
this.dataGridView1.Columns["Duration"].DefaultCellStyle.FormatProvider = new TimeSpanFormatter();
this.dataGridView1.DataError += DataGridView1_DataError;
this.dataGridView1.CellFormatting += DataGridView1_CellFormatting;
}
void DataGridView1_CellFormatting(object sender, DataGridViewCellFormattingEventArgs e)
{
if (e.CellStyle.FormatProvider is ICustomFormatter)
{
e.Value = (e.CellStyle.FormatProvider.GetFormat(typeof(ICustomFormatter)) as ICustomFormatter).Format(e.CellStyle.Format, e.Value, e.CellStyle.FormatProvider);
e.FormattingApplied = true;
}
}
private void DataGridView1_DataError(object sender, DataGridViewDataErrorEventArgs e)
{
throw new NotImplementedException();
}
public class TimeSpanFormatter : IFormatProvider, ICustomFormatter
{
public object GetFormat(Type formatType)
{
if (formatType == typeof(ICustomFormatter))
return this;
else
return null;
}
public string Format(string fmt, object arg, IFormatProvider formatProvider)
{
if (arg == null) return string.Empty;
if (arg.GetType() != typeof(TimeSpan))
try
{
return HandleOtherFormats(fmt, arg);
}
catch (FormatException e)
{
throw new FormatException(String.Format("The format of '{0}' is invalid.", fmt), e);
}
string tResult = string.Empty;
try
{
TimeSpan ts = (TimeSpan)arg;
tResult = string.Format("{0:N0}:{1}", ts.TotalHours,ts.Minutes);
}catch (Exception ex)
{
throw;
}
return tResult;
}
private string HandleOtherFormats(string format, object arg)
{
if (arg is IFormattable)
return ((IFormattable)arg).ToString(format, System.Globalization.CultureInfo.CurrentCulture);
else if (arg != null)
return arg.ToString();
else
return String.Empty;
}
}
}
现在我们有一个数据框,其中包含唯一字符向量的出现次数,字符向量是rownames。我们可以使用这些来对所述数据帧进行子集化。
df2 <- melt(df, id.vars = NULL)
df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame()
rownames(df2) <- df2$value
df2$value <- NULL
修改强>
看看其他答案,并且考虑到我们开始讨论性能,我意识到上述情况不太复杂,可以改进如下:
# df[] <- lapply(df, as.character) # in case they are stored as factors
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]
> df
yname1 yname2 yname3 r1 r2 r3
1 aaaaaa bbbaaa bbaaaa 3 1 1
2 aaabbb cccccc aaaaaa 2 2 3
3 aaaaaa aaabbb dddddd 3 2 2
4 cccccc dddddd eeeeee 2 2 1
这样可以完全避免调用df2 <- data.frame(table(unlist(df)), row.names = 1)
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]
和reshape2
,并相应地提高性能。使用
dplyr
这个解决方案现在要快得多,但速度不如某些替代方案快。参见
espresso <- function(df) {
cns <- grep(value=T,'^yname',names(df));
df2 <- data.frame(table(unlist(df[cns])), row.names = 1)
df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df
};
答案 4 :(得分:2)
我认为找到每个唯一值的总和然后加入原始表会更快。
all_yname <-c(df$yname1, df$yname2, df$yname3)
rcount <- as.data.frame(table(all_yname))
merge(df, rcount, by.x = 'yname1', by.y = 'all_yname')
并为每一行重复合并。