我在第一DF中有106列,在第二DF中有97列,我想将它们合并。为此,我需要在两个DF中都具有相同的列。
那么我如何才能达到以下要求(在下面列出)。
DF1 :column names are A,B,C & D
DF2 :column names A,B & E.
可以选择以下数据框中的列组合吗?
1) Match in both i.e A & B
2) Extras in 2nd i.e E
3) Extras in first i.e C & D
我用select()
等在dplyr中尝试了colnames(df1) == colnames(df2)
等其他方式,但尝试了其他各种可能性,但没有获得成功。
下面是Dataframe1:
[1] "ï..Lan.ID" "NBFC" "Application.ID"
[4] "Region" "Loan.City" "Loan.Type"
[7] "Loan.Scheme" "Name" "Mobile.Number"
[10] "Loan.Status" "Principal.Outstanding" "Last.EMI"
[13] "Next.EMI" "Next.Bullet.Month" "Next.Bullet.Amount"
[16] "Sum.Instalment.Posted" "Dues.Receipts" "EMI.Due"
[19] "All.Dues" "Instalment.Dues" "Bullets.Overdue"
[22] "Loan.Quality" "Sanctioned.Amount" "Loan.Amount"
[25] "Tenure" "Completed.Tenure" "Tenure.Left"
[28] "Personal.Email" "Official.Email" "No..Of.Late.Payments"
[31] "CRIF.Score" "CIBIL.Score" "No.of.Actions"
[34] "Fixed.Income" "ECS.Customer.Name" "ECS.Bank.Name"
[37] "ECS.Account.Number" "Loan.Date" "Sanction.Month"
[40] "EMI.Start.Date" "X1st.EMI.Month" "End.Date"
[43] "Home.Address" "Permanent.Address" "Employer.Name"
[46] "Company.MCA.ID" "Business.Address" "Reference.Details"
[49] "Nature.of.Business" "Pan.Card" "Aadhar.UID"
[52] "Gender" "Educational.Qualification" "DOB"
[55] "Marital.Status" "Last.Payment.Date" "Job.Type"
[58] "Employment.Year" "Cycle.Date" "Age"
[61] "relevant_pos" "crif_active_accounts" "crif_overdue_amt"
[64] "crif_current_outstanding" "cibil_active_accounts" "cibil_overdue_amt"
[67] "cibil_current_outstanding" "NACH.Status" "Awarenss.Allocation"
[70] "Allocation.Date" "Awareness.Data" "Awareness.Brk.up"
[73] "Dec.19.EMI.Amount" "Tenure.End" "Dec.19.BKt"
[76] "DPD" "New.DPD" "DPD.Range.New"
[79] "New.Amount.Due" "New.Total.Due" "Loan.Slabs"
[82] "Last.Month.Bnc" "X1st.EMI" "Dec.19.Bnc"
[85] "Dec.19.Non.Starter" "Reason.of.Bnc" "HNI"
[88] "EMI.Due.1" "OS" "Advance.Paid"
[91] "Paid.Unpaid" "Not.Allocated" "Excess"
[94] "CC.Take.Over...OD" "Last.Month.delinq" "Loan.Status.1"
[97] "CIBIL.Bracket" "Salary.Bracket" "DPD.1"
[100] "Reason.of.Default" "Contactibility" "Delinq"
[103] "PayTm.Industry" "Industry" "Employer.Name.1"
[106] "DELINQ.NON.DELINQ"
数据框2:
[1] "ï..Lan.ID" "NBFC" "Application.ID"
[4] "Region" "Loan.City" "Loan.Type"
[7] "Loan.Scheme" "Name" "Mobile.Number"
[10] "Loan.Status" "Principal.Outstanding" "Last.EMI"
[13] "Next.EMI" "Next.Bullet.Month" "Next.Bullet.Amount"
[16] "Sum.Instalment.Posted" "Dues.Receipts" "EMI.Due"
[19] "All.Dues" "Instalment.Dues" "Bullets.Overdue"
[22] "Loan.Quality" "Sanctioned.Amount" "Loan.Amount"
[25] "Tenure" "Completed.Tenure" "Tenure.Left"
[28] "Personal.Email" "Official.Email" "No..Of.Late.Payments"
[31] "CRIF.Score" "CIBIL.Score" "No.of.Actions"
[34] "Fixed.Income" "ECS.Customer.Name" "ECS.Bank.Name"
[37] "ECS.Account.Number" "Loan.Date" "Sanction.Month"
[40] "EMI.Start.Date" "X1st.EMI.Month" "End.Date"
[43] "Home.Details" "Permanent.Address.Details" "Employer.Name"
[46] "Company.MCA.ID" "Business.Details" "Reference.Details"
[49] "Nature.of.Business" "Pan.Card" "Aadhar.UID"
[52] "Gender" "Educational.Qualification" "DOB"
[55] "Marital.Status" "Last.Payment.Date" "Job.Type"
[58] "Employment.Year" "Cycle.Date" "Age"
[61] "relevant_pos" "crif_active_accounts" "crif_overdue_amt"
[64] "crif_current_outstanding" "cibil_active_accounts" "cibil_overdue_amt"
[67] "cibil_current_outstanding" "NACH.status" "Awarenss.Allocation"
[70] "Allocation.Date" "Awareness.Data" "Awareness.Brk.up"
[73] "June.19.EMI.Amount" "Tenure.End" "June.BKt"
[76] "Loan.Slabs" "Last.Month.Bnc" "X1st.EMI"
[79] "June.19.Bnc" "June.19.Non.Starter" "Reason.of.Bnc"
[82] "HNI" "EMI.Due.1" "OS"
[85] "Advance.Paid" "PAID.Unpaid" "Not.Allocated"
[88] "Excess" "DPD" "CC.Take.Over"
[91] "Last.Month.delinq" "Loan.Status.1" "CIBIL.Bracket"
[94] "Salary.Bracket" "DPD.1" "DELINQ.NON.DELINQ"
[97] "Month"
此处的预期结果将是两个DF中匹配列的名称和未匹配列的名称。
答案 0 :(得分:1)
我认为Sotos的评论为您的问题提供了最优雅的输出。
不过,您也可以使用%in%
:
O1 = colnames(dfA)[colnames(dfA) %in% colnames(dfB)]
> O1
[1] "A" "B" "C"
但是,关于您的匹配条件2)和3),这有点令人困惑,因为当您要求时:
2)在第二和第二方面都相同,即A,B和E
我认为它对应于第二个数据集(colnames(dfB)
)中的所有列
3)首先在A,B,C和D中都存在,并且在其他方面都相同
这对应于第一个数据集(colnames(dfA)
)中的所有列
这对您有意义吗?我是否错过了您的合并模式中的某些内容?
数据
dfA = data.frame(matrix(sample(1:100, 16), ncol = 4, nrow = 4))
colnames(dfA) = LETTERS[1:4]
dfB = data.frame(matrix(sample(1:100, 16), ncol = 4, nrow = 4))
colnames(dfB) = LETTERS[c(1:3,5)]
> dfA
A B C D
1 75 66 17 89
2 46 7 27 38
3 97 26 47 31
4 32 20 71 2
> dfB
A B C E
1 94 70 18 16
2 69 57 29 60
3 53 50 25 96
4 37 51 64 75