Question

我有两个大文件，文件内容如下：

DF1

enter image description here

DF2

enter image description here

的输入

DF1

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L), .Label = c("00:00,location,long|", "00:00,location,long|00:00,location,same|", 
"00:00,location,long|00:00,runapps,com.sol.sviewcall|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,runapps,com.kakao.talk|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,wifi,dlink, iptime|00:00,wifi,dlink|", 
"00:00,location,long|00:00,wifi,dlink|", "00:00,location,long|00:00,wifi,dlink|00:00,location,same|00:00,wifi,dlink, iptime|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-183L))

DF2

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), .Label = c("00:00,location,long|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, iptime|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, SK_WiFi26C4, U+zone, U+Net642B|", 
"00:00,location,long|00:00,wifi,dlink, SK_WiFi26C4|", "01:00,location,long|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,screen,OFF|01:00,runapps,com.kakao.talk|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,wifi,dlink, iptime, SK_WiFi26C4|01:00,wifi,dlink, iptime, PISnet_4D9740|01:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|01:00,runapps,com.buzzpia.aqua.launcher|01:00,screen,OFF|", 
"01:00,location,long|01:00,screen,OFF|", "02:00,location,long|02:00,wifi,dlink, iptime, SK_WiFi26C4|02:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|02:00,wifi,dlink, iptime, KT_WLAN_BBE3|02:00,runapps,com.kakao.talk|02:00,screen,OFF|", 
"02:00,location,long|02:00,wifi,dlink, iptime|02:00,runapps,com.buzzpia.aqua.launcher|02:00,runapps,com.android.mms|02:00,screen,OFF|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-232L))

我的问题是：

我想知道所有行的匹配数据的百分比例如df1和df2之间具有相同数据的行数。
我想知道所有行的相似性数据的百分比，一数据看起来像＆＃34; 00 :, location，long＆＃34;我使用分隔符＆＃34; |＆＃34;至将一个数据与其他数据分开在这种情况下，如果在df1和df1中有一行 df2＆gt; = 75％相似，我认为行类似。例如，行包含三个数据，两个数据相同，一个数据不同，类似
我想知道df1中所有行的不同数据的百分比和df2

所以，我想计算，匹配行的百分比（df1中的行数与df2中的行匹配），相似行的百分比（df1中的行数与df2中的行类似）和百分比不同的行（df1中有多少行与df2中的行不同）

基础数据是df1，我的意思是我想知道df2与df1相匹配，相似或不同的行数

我使用R语言，我尝试了但是我坚持了。希望有人能给我一个亮点

Answer 1

我猜你的问题是找到df2中不在df1中的所有行或df2中df1中的所有行。如果你的意思是，你可以使用sqldf库

library(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
df2Indf1 <- sqldf('SELECT * FROM df2 INTERSECT SELECT * FROM df1')

另一种方法是，您可以使用dplyr

library(dplyr)
anti_join(df2,df1)
semi_join(df2,df1)

对于相似性，如果您要测量两个字符串数据之间的相似度得分，则可以使用Levenshtein Distance查看this link中的详细信息。您可以将其应用于数据框。

在两个文件R语言中查找匹配和相似性

1 个答案: