比较3个数据帧中的值并附加缺失值

时间:2018-04-26 20:35:34

标签: r dataframe

我有3个数据帧。

Data1 -
Name_description   Numbers 
ABC                23
DEF                34
GHI                45
XYZ                43
JVK                23
LMN                21

数据2只有名称列表

Data 2- 
    Names            
    ABC                
    DEF                
    GHI                
    XYZ                
    JVK                
    LMN    
    PQR
    KJL      

数据3再次具有名称和数字

Data 3
Name_desc           Numbers 
    ABC                56
    DEF                67
    GHI                89
    XYZ                60
    JVK                88
    LMN                65
    PQR                100
    KJL                85

我想做以下事情 -

Look for all names from data 2 are present in data 1
If any names are missing then 
{
get those names
get the numbers for those missing names from data 3
append above two things (missing names & numbers) to data 1
}
else
{data1<-data1
}

我只是想合并文件,但我还需要确保如果数据2中的数据1中没有名称丢失,那么数据1应保持不变。 (上面代码中提到的相同内容)

在上述情况下,我的最终输出应为

Data 1- 

Name_description   Numbers 
    ABC                23
    DEF                34
    GHI                45
    XYZ                43
    JVK                23
    LMN                21
    PQR                100
    KJL                85

谢谢

5 个答案:

答案 0 :(得分:1)

首先,合并Data2NA,然后在这个新的data.frame中找到Data3并将它们与Data3匹配,最后用{{替换它们1}}值。

> tmp <- merge(Data1, Data2, by.x="Name_description", by.y="Names", all=TRUE)
> ind <- match(tmp$Name_description[is.na(tmp$Numbers)], Data3$Name_desc)
> tmp$Numbers[ind] <- Data3$Numbers[ind]
> tmp
  Name_description Numbers
1              ABC      23
2              DEF      34
3              GHI      45
4              JVK      23
5              LMN      21
6              XYZ      43
7              KJL     100
8              PQR      85

答案 1 :(得分:1)

我发现dplyr::coalesceOP提到的情况下非常方便。加入3个数据框后,可以使用NumbersNA列(包含coalesce),可以使用library(dplyr) Data1 %>% full_join(Data2, by=c("Name_description" = "Names")) %>% inner_join(Data3, by=c("Name_description" = "Name_desc")) %>% mutate(Numbers = coalesce( Numbers.x, Numbers.y)) %>% select(Name_description, Numbers) # Name_description Numbers # 1 ABC 23 # 2 DEF 34 # 3 GHI 45 # 4 XYZ 43 # 5 JVK 23 # 6 LMN 21 # 7 PQR 100 # 8 KJL 85 合并为:

Data1 <- read.table(text = 
"Name_description   Numbers 
ABC                23
DEF                34
GHI                45
XYZ                43
JVK                23
LMN                21",
header = TRUE, stringsAsFactors = FALSE)

Data2 <- read.table(text = 
"Names            
ABC                
DEF                
GHI                
XYZ                
JVK                
LMN    
PQR
KJL",
header = TRUE, stringsAsFactors = FALSE)


Data3 <- read.table(text = 
"Name_desc           Numbers 
ABC                56
DEF                67
GHI                89
XYZ                60
JVK                88
LMN                65
PQR                100
KJL                85",
header = TRUE, stringsAsFactors = FALSE)

数据:

public static void main(String[] args) {
    LinkedList<String> list = new LinkedList<>();//declare your list
    Scanner input = new Scanner(System.in);//create a scanner
    System.out.println("How many participants? ");
    int nbr = input.nextInt();//read the number of element
    input.nextLine();
    do {
        System.out.println("What is the name of the people?");
        list.add(input.nextLine());//read and insert into your list in one shot
        nbr--;//decrement the index
    } while (nbr > 0);//repeat until the index will be 0

    System.out.println(list);//print your list

答案 2 :(得分:0)

使用dplyr,它应该类似于:

data1 %>% 
    bind_rows(
        data2 %>% 
        anti_join(data1) %>% 
        left_join(data3)
    )  

答案 3 :(得分:0)

我们可以使用dplyrleft_joinifelse中实现这一目标。

library(dplyr)

Data4 <- Data2 %>%
  left_join(Data1, by = c("Names" = "Name_description")) %>%
  left_join(Data3, by = c("Names" = "Name_desc")) %>%
  mutate(Numbers = ifelse(is.na(Numbers.x), Numbers.y, Numbers.x)) %>%
  select(Names, Numbers)
Data4
#    Names Numbers
# 1   ABC      23
# 2   DEF      34
# 3   GHI      45
# 4   XYZ      43
# 5   JVK      23
# 6   LMN      21
# 7   PQR     100
# 8   KJL      85

数据

Data1 <- read.table(text = "Name_description   Numbers 
ABC                23
DEF                34
GHI                45
XYZ                43
JVK                23
LMN                21",
                    header = TRUE, stringsAsFactors = FALSE)

Data2 <- read.table(text = "Names            
    ABC                
    DEF                
    GHI                
    XYZ                
    JVK                
    LMN    
    PQR
    KJL",
                    header = TRUE, stringsAsFactors = FALSE)

Data3 <- read.table(text = "Name_desc           Numbers 
    ABC                56
    DEF                67
    GHI                89
    XYZ                60
    JVK                88
    LMN                65
    PQR                100
    KJL                85",
                    header = TRUE, stringsAsFactors = FALSE)

答案 4 :(得分:0)

我们实际上根本不需要merge,你想要的是Number的第一个可用选择,从Data1然后Data3开始,当我NameData2而不在其他人中时,我想返回NA。

执行此操作的最快方法是使用data.table,但我也会提供其他选项。

<强> data.table

data.table::rbindlist默认情况下不使用名称(use.names=FALSE),因此在这种情况下非常方便。

library(data.table)
rbindlist(list(Data1,Data3,Data2))[,.SD[1,],by="Name_description"]

# 1:              ABC      23
# 2:              DEF      34
# 3:              GHI      45
# 4:              XYZ      43
# 5:              JVK      23
# 6:              LMN      21
# 7:              PQR     100
# 8:              KJL      85

tidyverse解决方案

.keep_all dplyr::distinct参数对于避免使用%>% filter(!duplicated(Names))%>% group_by(Names) %>% Slice(1)的可读性较低非常有用。

library(tidyverse)
lst(Data1,Data3,cbind(Data2,NA)) %>%
  map(setNames,c("Names","Numbers")) %>%
  bind_rows %>%
  distinct(Names,.keep_all = TRUE) 

# Names Numbers
# 1   ABC      23
# 2   DEF      34
# 3   GHI      45
# 4   XYZ      43
# 5   JVK      23
# 6   LMN      21
# 7   PQR     100
# 8   KJL      85

基础解决方案

x <- do.call(rbind,lapply(list(Data1,Data3,cbind(Data2,NA)),setNames,c("Names","Numbers")))
x[!duplicated(x[[1]]),]  
#    Names Numbers
# 1    ABC      23
# 2    DEF      34
# 3    GHI      45
# 4    XYZ      43
# 5    JVK      23
# 6    LMN      21
# 13   PQR     100
# 14   KJL      85