Question

所以我看到了一些关于加入不同包的答案。我需要连接几个数据帧，这是我的计算机处理基本“合并”算法的非常昂贵的操作。

我的数据：

 public boolean onNavigationItemSelected(MenuItem item) {
    // Handle navigation view item clicks here.
    int id = item.getItemId();
    Fragment fragment = null;
    Class fragmentClass = null;
    if (id == R.id.nav_schedule) {
        fragmentClass = ScheduleFragment.class;
    } else if (id == R.id.nav_assingment) {
        fragmentClass = AssignmentsFragment.class;
    } else if (id == R.id.nav_teachers) {
        fragmentClass = TeachersFragment.class;
    } else if (id == R.id.nav_score) {
        fragmentClass = ScoreFragment.class;
    } else if (id == R.id.nav_events) {
        fragmentClass = EventsFragment.class;
    } else if (id == R.id.nav_setting) {

    }else if (id == R.id.nav_about) {
        Intent intent = new Intent(classname.this,AboutActivity.class);
        startActivity(intent);
        return true;
    }else if (id == R.id.nav_logout) {
        Intent i = new Intent(classname.this,LoginActivity.class);
        i.addFlags(Intent.FLAG_ACTIVITY_CLEAR_TOP);
        i.addFlags(Intent.FLAG_ACTIVITY_CLEAR_TASK);
        startActivity(i);
        return true;
    }
    try {
        try {
            fragment = (Fragment) fragmentClass.newInstance();
        } catch (InstantiationException e) {
            e.printStackTrace();
        } catch (IllegalAccessException e) {
            e.printStackTrace();
        }
        FragmentManager fragmentManager = getSupportFragmentManager();
        fragmentManager.beginTransaction().replace(R.id.container, fragment).commit();
    }catch (android.app.Fragment.InstantiationException e){
        e.printStackTrace();
    }
    DrawerLayout drawer = (DrawerLayout) findViewById(R.id.drawer_layout);
    drawer.closeDrawer(GravityCompat.START);
    return true;
}

我试过这两个代码：
使用重塑

list.of.data.frames = list( data.table("P1" = c(1:3,1:3), "P2" = c(rep(2.5,3),rep(1.5,3)), "D1" = c(3.5,4.5,5.5,2.5,3.5,4.5)),
                        data.table("P1" = c(1:3,1:3), "P3" = c(rep(2,3),rep(3,3)), "D3" =c(3:5,4:6)),
                        data.table("P1" = c(2:4), "P4" = c(2:4))
                        )

使用基础R

library(reshape)
merge_recurse(list.of.data.frames)

输出：

Reduce(function(...) merge(..., all=T), list.of.data.frames)

我正在尝试获得最快的方法来执行此操作，这可以很容易地扩展到列表上。对于P1 P2 D1 P3 D3 P4 1: 1 2.5 3.5 2 3 NA 2: 1 2.5 3.5 3 4 NA 3: 2 2.5 4.5 2 4 2 4: 2 2.5 4.5 3 5 2 5: 3 2.5 5.5 2 5 3 6: 3 2.5 5.5 3 6 3 7: 1 1.5 2.5 2 3 NA 8: 1 1.5 2.5 3 4 NA 9: 2 1.5 3.5 2 4 2 10: 2 1.5 3.5 3 5 2 11: 3 1.5 4.5 2 5 3 12: 3 1.5 4.5 3 6 3 13: 4 NA NA NA NA 4，我遇到了键，因为每个data.frame（或数据表）可以有不同的列，其中一些可能与其他表相交或不相交..

另外，我看到了data.table功能，但我不知道这是否来自旧版本，因为我无法在我的控制台中找到它。

知道如何继续吗？

提前谢谢

Answer 1

值得注意的是：您所包含的方法都不适用于我，我无法操纵reshape方法让它运行。

正如@David在评论中提到的那样，您已经在merge.data.table方法中使用base，因为merge是一个通用的“＃34;放手”＃34;更具体的方法（在这种情况下，对于data.table）。

This answer有一个使用dplyr left_join进行多次合并的版本，可以在此修改：

Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list.of.data.frames)

我们可以使用microbenchmark包明确测试各种方法。我正在添加一个版本，我告诉left_join哪个列要加入，而不是让它搞清楚（尽管如果每个连接需要使用不同的列集来匹配，那么这将不起作用）。我还包括@ Axeman建议使用reduce而不是purrr中的Reduce。

microbenchmark(
  base = Reduce(function(...) merge(..., all=T, by = "P1"), list.of.data.frames)
  , dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list.of.data.frames)
  , dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list.of.data.frames)
  , dplyrPurrr = reduce(list.of.data.frames, full_join, by = "P1")
)

给出：

Unit: microseconds
       expr      min        lq      mean    median       uq      max neval cld
       base 2911.495 3025.2325 3227.3762 3077.8530 3211.995 5513.166   100   c
      dplyr  946.367 1022.0960 1087.8771 1066.3615 1131.675 1429.581   100  b 
   dplyrSet  443.828  485.3235  543.7130  511.1545  553.040 1918.009   100 a  
 dplyrPurrr  465.329  494.6615  548.7349  515.6695  551.943 1804.394   100 a

因此，left_join大约比merge快3倍，并且将变量设置为在大约一半的时间内进一步削减。 reduce并没有减少时间，尽管它确实可以提供更清晰的代码。

我们可以（并且应该像@Frank指出的那样）确认返回的值是否相同。关于什么＆＃34;相同＆＃34;有一些争论。可能意味着出于此类结果的目的，因此我使用compare包中的compare来检查差异（每个full_join方法完全相同，所以我只是显示有趣的一个）：

compare(
  Reduce(function(...) merge(..., all=T, by = "P1"), list.of.data.frames)
  , reduce(list.of.data.frames, full_join, by = "P1")
  , allowAll = TRUE
  )

返回：

TRUE
  sorted
  renamed rows
  dropped row names
  dropped attributes

因此，值是相同的，但它们的顺序不同（需要排序），具有不同的行名（需要重命名/删除），并且具有不同的属性（需要删除）。如果其中任何一个与用例有关，那么用户需要确定哪种方法给出了他们想要的排序/ rownames /属性。

正如@DavidArenburg所指出的，不同的尺寸可能导致不同的结果。所以，这里有一些代码可以检查这些不同的大小。

medianTimes_dataTable <- lapply(10^(1:5), function(n){
  list_of_longer_ones = list( data.table("P1" = c(1:n), "P2" = rnorm(n), "D1" = rnorm(n)),
                              data.table("P1" = sample(1:n), "P3" = rnorm(n), "D3" =rnorm(n)),
                              data.table("P1" = sample(1:n), "P4" = rnorm(n))
  )


  microbenchmark(
    base = Reduce(function(...) merge(..., all=T, by = "P1"), list_of_longer_ones)
    , dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list_of_longer_ones)
    , dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list_of_longer_ones)
    , dplyrPurrr = reduce(list_of_longer_ones, full_join, by = "P1")
  ) %>%
    group_by(expr) %>%
    summarise(median = median(time)) %>%
    mutate(nRows = n)
}) %>%
  bind_rows

medianTimes_dataTable %>%
  mutate_at(c("median", "nRows"), format, big.mark = ",", scientific = FALSE) %>%
  spread(nRows, median)

给出

        expr     `     10`     `    100`     `  1,000`     ` 10,000`     `100,000`
*     <fctr>         <chr>         <chr>         <chr>         <chr>         <chr>
1       base   2,032,614.5   2,059,519.0   2,716,534.0   4,475,653.5  29,655,330.0
2      dplyr   1,147,676.5   1,205,818.0   2,369,464.5  11,170,513.5 154,767,265.5
3   dplyrSet     537,434.0     613,785.5   1,602,681.0  10,215,099.5 145,574,663.0
4 dplyrPurrr     540,455.5     626,076.5   1,549,114.0  10,040,808.5 145,086,376.0

因此，dplyr优势会在1,000到10,000之间滑落。

@David还询问了data.table与data.frame的影响，因此我在data.frames上运行了相同的代码

medianTimes_dataFrame <- lapply(10^(1:5), function(n){
  list_of_longer_ones = list( data.frame("P1" = c(1:n), "P2" = rnorm(n), "D1" = rnorm(n)),
                              data.frame("P1" = sample(1:n), "P3" = rnorm(n), "D3" =rnorm(n)),
                              data.frame("P1" = sample(1:n), "P4" = rnorm(n))
  )


  microbenchmark(
    base = Reduce(function(...) merge(..., all=T, by = "P1"), list_of_longer_ones)
    , dplyr = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2), list_of_longer_ones)
    , dplyrSet = Reduce(function(dtf1,dtf2) full_join(dtf1,dtf2, by = "P1"), list_of_longer_ones)
    , dplyrPurrr = reduce(list_of_longer_ones, full_join, by = "P1")
  ) %>%
    group_by(expr) %>%
    summarise(median = median(time)) %>%
    mutate(nRows = n)
}) %>%
  bind_rows

medianTimes_dataFrame %>%
  mutate_at(c("median", "nRows"), format, big.mark = ",", scientific = FALSE) %>%
  spread(nRows, median)

给出

        expr     `     10`     `    100`     `  1,000`     ` 10,000`     `100,000`
*     <fctr>         <chr>         <chr>         <chr>         <chr>         <chr>
1       base     806,009.5     973,636.0   2,046,009.5  19,088,482.5 519,159,607.0
2      dplyr   1,092,747.0   1,242,550.5   2,010,648.5  10,618,735.5 156,958,793.0
3   dplyrSet     526,030.0     616,996.0   1,343,766.5   9,767,689.5 147,919,013.5
4 dplyrPurrr     541,182.0     624,208.0   1,351,910.0   9,711,435.0 146,379,176.5

此处，full_join继续超过merge - 这表明merge.data.table优于merge.data.frame方法（并且很多）。

使用data.table通过多个变量连接多个数据帧

1 个答案: