我想要包含表格" tech_distance"进入我的"第一次出现"表。 两个数据表:
head(first_occurrences)
# A tibble: 6 x 4
# Groups: Main, Second [6]
year Main Second occurrence
<int> <chr> <chr> <int>
1 1991 C09D C08F 1
2 2002 A47C A47D 1
3 2002 G10K H05K 1
4 2004 G06G C07K 1
5 2015 B64F B64D 1
6 2015 H02G B29C 1
head(tech_distance)
# A tibble: 6 x 2
Main tech_distance
<fctr> <dbl>
1 C09D 0.3
2 A47C 0.0
3 G10K 0.5
4 G06G 0.5
5 B64F 0.0
6 H02G 0.5
这是我想要得到的结果:
head(first_occurrences)
Main year Second occurrence tech_distance
1 A01B 2004 E21B 1 0.7
2 A01B 2004 E21B 1 0.5
3 A01B 2004 E21B 1 0.7
4 A01B 2004 E21B 1 0.5
5 A01B 2004 E21B 1 0.5
6 A01B 2004 E21B 1 1.0
我在dplyr中使用了mutate:
first_occurrences <- data %>%
select(year = X3,Main = X7,Second = X8) %>%
group_by(Main,Second) %>%
mutate(occurrence = n(), tech_distance) %>%
filter(occurrence >= 0, occurrence <= 1, !(Main == Second))
但是我收到了这个错误:
Error in mutate_impl(.data, dots) :
Column `tech_distance` must be length 24 (the group size) or one, not 2
所以我尝试使用merge():
first_occurrences <- merge(first_occurrences, tech_distance, by.x = "Main", by.y = "Main", all.x=T)
这似乎有效,但我得到了大量的行(240,217个条目)
str(first_occurrences)
'data.frame': 240217 obs. of 5 variables:
$ Main : chr "A01B" "A01B" "A01B" "A01B" ...
$ year : int 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
$ Second : chr "E21B" "E21B" "E21B" "E21B" ...
$ occurrence : int 1 1 1 1 1 1 1 1 1 1 ...
$ tech_distance: num 0.7 0.5 0.7 0.5 0.5 1 0.5 0.7 0.3 0 ...
以前的数据集是:
str(first_occurrences)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of 4 variables:
$ year : int 1991 2002 2002 2004 2015 2015 2015 2015 2015 2015 ...
$ Main : chr "C09D" "A47C" "G10K" "G06G" ...
$ Second : chr "C08F" "A47D" "H05K" "C07K" ...
$ occurrence: int 1 1 1 1 1 1 1 1 1 1 ...
str(tech_distance)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 8015 obs. of 2 variables:
$ Main : Factor w/ 815 levels "A01B","A01C",..: 345 62 684 651 265 749 328 735 173 788 ...
$ tech_distance: num 0.3 0 0.5 0.5 0 0.5 0.5 0 0.5 0.5 ...
有没有人知道如何合并两个保持相同行数的数据帧?
答案 0 :(得分:1)
基于上述评论;
如果tech_distance因多项内容而异,例如main和second,我实际上会创建一个新列,然后使用它来执行left_join
。
first_occurrences <- mutate(first_occurrences, ID = paste0(main, "_", second, "_", year)
tech_distance <- mutate(tech_distance, ID = paste0(main, "_", second, "_", year)
combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "ID")
对于重新排序列,您只需使用select(#order of columns separated by names, -ID)
对于其他可能正在阅读此内容的人:
假设tech_distance是每个main特定的,而不是其他任何东西,我会使用:
combined_data <- dplyr::left_join(first_occurrences, tech_distance, by = "main")
答案 1 :(得分:0)
Main
列是否都是唯一的?如果是,那么你可以得到一对一的匹配,你的结果将有8015行。如果存在重复项,那么您将获得一对多匹配并获得更多行。