有效地计算跨多个列的字符串的出现次数

时间:2016-05-27 19:54:52

标签: r dataframe

我有一个大型数据框(> 400万行),其中包含存储字符串的yname2yname3yname1 | yname2 | yname3 aaaaaa | bbbaaa | bbaaaa aaabbb | cccccc | aaaaaa aaaaaa | aaabbb | dddddd cccccc | dddddd | eeeeee 列:

yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa |       3 |       1 |       1
aaabbb | cccccc | aaaaaa |       2 |       2 |       3
aaaaaa | aaabbb | dddddd |       3 |       2 |       2
cccccc | dddddd | eeeeee |       2 |       2 |       1

现在我想计算所有列中每个字符串的总出现次数。这些应作为附加列添加:

data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))

我已经编写了以下代码,完成了这项工作:

@Select(value= "{ CALL getTotalCityStateId()}")
@Options(statementType = StatementType.CALLABLE)
@ResultType(State.class)
@Results(
{   
 @Result(property="id", column="state_id"),
 @Result(property="name", column="state_name"),
 @Result(property="code", column="state_code"),
})
List<State> callGetStatesAnnotations();

然而,这非常缓慢,需要数天才能计算出来。我有什么想法可以加快速度吗?

5 个答案:

答案 0 :(得分:6)

data.table方法怎么样:

library(data.table)
setDT(d)

lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
#    value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1

d[, paste0("rcount", 1:3) :=
   lapply(d, function(x) lookup[x, , on = .(value)][, N])]

#   yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa       3       1       1
#2: aaabbb cccccc aaaaaa       2       2       3
#3: aaaaaa aaabbb dddddd       3       2       2
#4: cccccc dddddd eeeeee       2       2       1

Microbenchmark输出复制来自bgoldst的例子,但有400,000行。

Unit: seconds
          expr       min        lq      mean    median        uq       max neval
   bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697     3  
 alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850     3  
      jota(dt)  2.629337  2.692613  2.719207  2.755889  2.764141  2.772394     3   
    mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702     3
   coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334     3 
  espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306     3

答案 1 :(得分:6)

在基础R中,您可以构建一个包含data.frame的未列出值的表,并按值对其进行索引。确保你索引的是一个字符串,而不是一个因子(因此是as.character),或者它将被数字而不是名称索引。

data.frame(df, 
           lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
#   yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa    3      1      1
# 2 aaabbb cccccc aaaaaa    2      2      3
# 3 aaaaaa aaabbb dddddd    3      2      2
# 4 cccccc dddddd eeeeee    2      2      1

如果data.frame足够大而且速度很慢,那么您可以在lapply之外构建表,因此它只运行一次:

df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))

你也可以把它放在dplyr中,这使它更具可读性:

       # look up times repeated
df %>% mutate_each(funs(table(unlist(df))[as.character(.)])) %>%   # or mutate_each(funs(df_table[as.character(.)]))
    # fix column names
    select(rcount = starts_with('yname')) %>% 
    # add original df back in
    bind_cols(df, .)
# Source: local data frame [4 x 6]
# 
#   yname1 yname2 yname3 rcount1 rcount2 rcount3
#   (fctr) (fctr) (fctr)  (tabl)  (tabl)  (tabl)
# 1 aaaaaa bbbaaa bbaaaa       3       1       1
# 2 aaabbb cccccc aaaaaa       2       2       3
# 3 aaaaaa aaabbb dddddd       3       2       2
# 4 cccccc dddddd eeeeee       2       2       1

数据

df <- structure(list(yname1 = c("aaaaaa", "aaabbb", "aaaaaa", "cccccc"
    ), yname2 = c("bbbaaa", "cccccc", "aaabbb", "dddddd"), yname3 = c("bbaaaa", 
    "aaaaaa", "dddddd", "eeeeee")), .Names = c("yname1", "yname2", 
    "yname3"), row.names = c(NA, -4L), class = "data.frame")

答案 2 :(得分:5)

已经有一些好的解决方案,但没有人使用match()来查找预先计算的频率表中的每个字符串。以下是如何做到这一点。请注意,我选择as.matrix()yname*的参数和table()的第一个参数生成match()列的矩阵。

cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
##   yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa       3       1       1
## 2 aaabbb cccccc aaaaaa       2       2       3
## 3 aaaaaa aaabbb dddddd       3       2       2
## 4 cccccc dddddd eeeeee       2       2       1

更新:我无法相信我之前错过了这个,但表达方式

cnts[match(m,names(cnts))]

可以替换为

cnts[m]

因此根本不需要调用match()

我只是重新评估基准测试,发现它并没有以任何显着的方式改变我的解决方案的运行时间(可能只是在小规模测试中略微加速)。据推测,这是因为索引带有字符名称的向量需要在内部使用相同类型的match()逻辑,因此上述替换不会获得任何性能。但我认为简洁和简洁的改进是值得的。

基准

我应该注意到,我对其他一些解决方案进行了一些小的修改,以便产生这些基准测试结果。最值得注意的是,我想避免为重复执行复制任何输入,但由于data.tables通过引用传递,我不得不修改jota()以使其成为幂等的。这涉及仅对目标yname*列进行过滤,我通过cns调用将其预先计算为名为grep()的局部变量,就像我在自己的解决方案中一样。为了公平起见,我向所有解决方案添加了相同的grep()调用和过滤逻辑,但markus()除外,因为它显式处理,因此不需要它每列分开。我还将lookup中的jota()上的索引连接操作更改为lookup[.(value=x),,on='value'],因为它对我来说不起作用。最后,对于mhairi(),我通过在所有Reduce()列中添加yname*调用来完成解决方案。

library(microbenchmark);
library(data.table);
library(dplyr);

bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);

ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
##           expr      min        lq      mean    median       uq      max neval
##    bgoldst(df)  491.373  544.6165  597.4743  575.8350  609.192 2054.872   100
##     markus(df)  375.907  435.5645  463.7258  467.4250  489.022  549.962   100
##  alistaire(df)  754.380  816.1755  849.8749  840.3385  888.021  959.654   100
##       jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930   100
##     mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045   100
##    coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441   100
## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);

ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
##           expr        min         lq       mean     median         uq        max neval
##    bgoldst(df)   85.20766   87.00487   88.39154   88.80209   89.98348   91.16487     3
##     markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390     3
##  alistaire(df)   83.03348   83.10276   83.18116   83.17204   83.25500   83.33797     3
##       jota(dt)   12.49174   13.82088   14.44939   15.15002   15.42821   15.70640     3
##     mhairi(df)  156.06459  156.36608  158.27256  156.66758  159.37654  162.08551     3
##    coffein(df)  154.02853  154.97215  156.52246  155.91576  157.76942  159.62309     3

答案 3 :(得分:3)

我更喜欢上面的答案,但为了完整性,让我添加一个替代方案,它基于使用唯一字符串作为rownames:

 public partial class Form1 : Form
{
    public DataTable dt;
    public Form1()
    {
        InitializeComponent();
        dt = new DataTable("TestTable");
        dt.Columns.Add("Duration", typeof(TimeSpan));
        DataRow dr = dt.NewRow();
        dr.ItemArray = new object[] { new TimeSpan(1, 1, 1, 1) };
        dt.Rows.Add(dr);
        dataGridView1.DataSource = dt;
        this.dataGridView1.Columns["Duration"].DefaultCellStyle.Format = "l";

        this.dataGridView1.Columns["Duration"].DefaultCellStyle.FormatProvider = new TimeSpanFormatter();

        this.dataGridView1.DataError += DataGridView1_DataError;
        this.dataGridView1.CellFormatting += DataGridView1_CellFormatting;

    }



    void DataGridView1_CellFormatting(object sender, DataGridViewCellFormattingEventArgs e)
    {

        if (e.CellStyle.FormatProvider is ICustomFormatter)
        {
            e.Value = (e.CellStyle.FormatProvider.GetFormat(typeof(ICustomFormatter)) as ICustomFormatter).Format(e.CellStyle.Format, e.Value, e.CellStyle.FormatProvider);
            e.FormattingApplied = true;
        }
    }
    private void DataGridView1_DataError(object sender, DataGridViewDataErrorEventArgs e)
    {
        throw new NotImplementedException();
    }

    public class TimeSpanFormatter : IFormatProvider, ICustomFormatter
    {

        public object GetFormat(Type formatType)
        {
            if (formatType == typeof(ICustomFormatter))
                return this;
            else
                return null;
        }

        public string Format(string fmt, object arg, IFormatProvider formatProvider)
        {
            if (arg == null) return string.Empty;

            if (arg.GetType() != typeof(TimeSpan))
                try
                {
                    return HandleOtherFormats(fmt, arg);
                }
                catch (FormatException e)
                {
                    throw new FormatException(String.Format("The format of '{0}' is invalid.", fmt), e);
                }

            string tResult = string.Empty;
            try
            {
                TimeSpan ts = (TimeSpan)arg;
                tResult = string.Format("{0:N0}:{1}", ts.TotalHours,ts.Minutes);
            }catch (Exception ex)
            {
                throw;
            }
            return tResult;
        }

        private string HandleOtherFormats(string format, object arg)
        {
            if (arg is IFormattable)
                return ((IFormattable)arg).ToString(format, System.Globalization.CultureInfo.CurrentCulture);
            else if (arg != null)
                return arg.ToString();
            else
                return String.Empty;
        }
    }
}

现在我们有一个数据框,其中包含唯一字符向量的出现次数,字符向量是rownames。我们可以使用这些来对所述数据帧进行子集化。

df2 <- melt(df, id.vars = NULL)
df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame()
rownames(df2) <- df2$value
df2$value <- NULL

修改

看看其他答案,并且考虑到我们开始讨论性能,我意识到上述情况不太复杂,可以改进如下:

# df[] <- lapply(df, as.character) # in case they are stored as factors
df$r1 <- df2[df$yname1,]
df$r2 <- df2[df$yname2,]
df$r3 <- df2[df$yname3,]

> df
  yname1 yname2 yname3 r1 r2 r3
1 aaaaaa bbbaaa bbaaaa  3  1  1
2 aaabbb cccccc aaaaaa  2  2  3
3 aaaaaa aaabbb dddddd  3  2  2
4 cccccc dddddd eeeeee  2  2  1

这样可以完全避免调用df2 <- data.frame(table(unlist(df)), row.names = 1) df$r1 <- df2[df$yname1,] df$r2 <- df2[df$yname2,] df$r3 <- df2[df$yname3,] reshape2,并相应地提高性能。使用

dplyr

这个解决方案现在要快得多,但速度不如某些替代方案快。参见

espresso <- function(df) { 
  cns <- grep(value=T,'^yname',names(df)); 
  df2 <- data.frame(table(unlist(df[cns])), row.names = 1)
  df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df
};

答案 4 :(得分:2)

我认为找到每个唯一值的总和然后加入原始表会更快。

all_yname <-c(df$yname1, df$yname2, df$yname3)
rcount <- as.data.frame(table(all_yname))

merge(df, rcount, by.x = 'yname1', by.y = 'all_yname')

并为每一行重复合并。