Question

我有一个长度为5923的数据帧和第二个长度为68709的数据帧。第一个数据框看起来像this，第二个数据框看起来像this

他们的共同专栏是第一栏＆＃34; people_id＆＃34;。

到目前为止，我已经做到了这一点：

#
# This R function merges two tables and returns the resulting table in a new data frame.
    # inputs
# 1. tbl1 loaded from a csv file.
# 2. tbl2 is output from an query containing people_id and repository_id
# There can be multiple repository ids associated to each people id
#
mergetbl <- function(tbl1, tbl2)
{
  # tbl1 -- from csv file
  # tbl2 -- from sql query
  # 1. create an empty data frame
  # 2. go through tbl1 row by row
  # 3. for each row in tbl1, look at the current people_id in tbl2 and extract all associated repository_id
  # 4. duplicate the same row in tbl1 the same number of times there are associated repository ids
  # 5. merge duplicate rows with the column repository ids
  # 6. merge duplicate rows into new data frame
  # 7. repeat from 2. until last row in tbl1
  newtbl = data.frame(people_id=numeric(),repoCommitted=numeric(),isAuthor=numeric(),repoAuthor=numeric(),commonRepo=numeric())

  ntbl1rows<-nrow(tbl1)
  tbl2patched<-tbl2[complete.cases(tbl2),]
  for(n in 1:ntbl1rows)
  {
    ndup<-nrow(tbl2patched[tbl2patched$people_id==tbl1$people[n],])
    duprow<- tbl1[rep(n,ndup),]
    newtbl<-rbind(newtbl,duprow)


  }
}

我陷入了第5步，它从tbl2修补了＆＃34; repository_id＆＃34;到id匹配的newtbl。第一个数据框如下所示：

    people  committers  repositoryCommitter authors repositoryAuthor
 1  1       921         183                896      178
 2  2       240         18                 209      22
 3  3       3           2                  28       11
 4  4       6548        23                 6272     29
 5  5       3557        146                3453     146

依此类推......直到5923行返回。

第二个数据框：

    people_id repository_id
    1           1
    1           2
    1           6
    1           7
    1           10

等到68709行。

输出应如下所示：这就是样本的样子：

    people_id committers   repoCommitter authors   repoAuthors  commonRepo
1    1        921          183            896       178           1
2    1        921          183            896       178           2
3    1        921          183            896       178           6
4    1        921          183            896       178           7
5    1        921          183            896       178           10

Answer 1

我加载了您的数据，认识到每个CSV文件中的第一列明确包含行名称：

people <- read.csv('people.csv',row.names=1);
peoplePerRepo <- read.csv('peoplePerRepo.csv',row.names=1);

结果data.frames的示例：

head(people);
##   people committers repositoryCommitter authors repositoryAuthor
## 1      1        921                 183     896              178
## 2      2        240                  18     209               22
## 3      3          3                   2      28               11
## 4      4       6548                  23    6272               29
## 5      5       3557                 146    3453              146
## 6      6        445                  55     444               55
head(peoplePerRepo);
##   people_id repository_id
## 1         1             1
## 2         1             2
## 3         1             6
## 4         1             7
## 5         1            10
## 6         1            11

我们应该注意的一个细节是，关键列名称不一致：people$people与peoplePerRepo$people_id。正如我们所见，我们无法处理任何事情。

我为了自己的利益调查了这些数据，但我会在此处添加一些结果，以确保我们处于同一页面。首先，行数：

nrow(people);
## [1] 5923
nrow(peoplePerRepo);
## [1] 72179

所以你声称第一个data.frame的长度为5923已经确认，但是第二个数据的长度超过你声称的68709：它是72179.我检查了是否有一些重复的行，但是那里似乎不是：

nrow(unique(people));
## [1] 5923
nrow(unique(peoplePerRepo));
## [1] 72179

所以我们有两个data.frames，包含5923和72179个唯一行。看着钥匙：

range(people$people);
## [1]    1 5923
setdiff(1:5923,people$people);
## integer(0)
range(peoplePerRepo$people_id);
## [1]    1 5923
setdiff(1:5923,peoplePerRepo$people_id);
## integer(0)

以上证明每个data.frame中的键列仅包含1：5923中的值，并且该范围中的每个值在两个表中至少表示一次。因为people$people正好是5923个元素，所以我们知道1：5923中的每个值都必须恰好表示一次。

range(table(people$people));
## [1] 1 1
range(table(peoplePerRepo$people_id));
## [1]   1 466

上述两个陈述中的第一个再次证实了我刚才所说的内容，即在people$people中，1：5923中的每个值都只代表一次。第二个陈述显示在peoplePerRepo$people_id中，值1：5923的频率范围为1到466.因此，这绝对是一对多的关系。您可以通过省略range()调用来检查确切的频率，IOW只运行table(peoplePerRepo$people_id)，但输出结果很详细，我不会在此处包含它。

最后，检查关键列中的NA总是好的。我们已经可以推断出people$people中不存在任何NAs，因为它包含的集合1：5923，但我们至少应该检查peoplePerRepo$people_id：

sum(is.na(people$people));
## [1] 0
sum(is.na(peoplePerRepo$people_id));
## [1] 0

因此，关键列中没有NA。

最后，看看summary()函数，它可以方便（通常）获取向量或data.frame的所有列的快速摘要统计信息。

summary(people);
##      people       committers      repositoryCommitter    authors        repositoryAuthor
##  Min.   :   1   Min.   :    0.0   Min.   :  0.0       Min.   :    0.0   Min.   :  0.00
##  1st Qu.:1482   1st Qu.:    0.0   1st Qu.:  0.0       1st Qu.:    2.0   1st Qu.:  1.00
##  Median :2962   Median :    0.0   Median :  0.0       Median :    6.0   Median :  2.00
##  Mean   :2962   Mean   :  200.0   Mean   : 11.6       Mean   :  198.2   Mean   : 14.06
##  3rd Qu.:4442   3rd Qu.:   39.5   3rd Qu.:  3.0       3rd Qu.:   63.0   3rd Qu.:  8.00
##  Max.   :5923   Max.   :15959.0   Max.   :466.0       Max.   :15938.0   Max.   :465.00
summary(peoplePerRepo);
##    people_id      repository_id
##  Min.   :   1.0   Min.   :   1.0
##  1st Qu.: 151.0   1st Qu.: 114.0
##  Median : 459.0   Median : 224.0
##  Mean   : 938.2   Mean   : 513.8
##  3rd Qu.:1147.0   3rd Qu.:1045.0
##  Max.   :5923.0   Max.   :1418.0
##                   NA's   :3470

因此，基于上述所有内容，我们可以通过一次merge()调用来满足您的要求：

output <- merge(people,peoplePerRepo,by.x='people',by.y='people_id');
nrow(output);
## [1] 72179
head(output);
##   people committers repositoryCommitter authors repositoryAuthor repository_id
## 1      1        921                 183     896              178             1
## 2      1        921                 183     896              178             2
## 3      1        921                 183     896              178             6
## 4      1        921                 183     896              178             7
## 5      1        921                 183     896              178            10
## 6      1        921                 183     896              178            11

输出行数72179是有意义的。由于任何一个键列都不包含任何NAs，并且peoplePerRepo的每一行中的键恰好与people中的一个键匹配，因此所有72179都成功地与people的一行成功连接。您可以看到此示例中的实际数据与您的预期输出相匹配。

最后一个（次要）点：输出列名称与预期的输出列名称不完全匹配。这可以通过从头开始分配整个列名称向量，或者通过有选择地替换要更改的列名称来解决。在这里，我将展示后一种方法：

names(output)[names(output)=='people'] <- 'people_id';
names(output)[names(output)=='repositoryCommitter'] <- 'repoCommitter';
names(output)[names(output)=='repositoryAuthor'] <- 'repoAuthors';
names(output)[names(output)=='repository_id'] <- 'commonRepo';
head(output);
##   people_id committers repoCommitter authors repoAuthors commonRepo
## 1         1        921           183     896         178          1
## 2         1        921           183     896         178          2
## 3         1        921           183     896         178          6
## 4         1        921           183     896         178          7
## 5         1        921           183     896         178         10
## 6         1        921           183     896         178         11

在R中合并具有不同长度的数据

1 个答案: