在两个文件R语言中查找匹配和相似性

时间:2015-03-11 12:14:09

标签: r dataframe compare similarity

我有两个大文件,文件内容如下:

DF1

enter image description here

DF2

enter image description here

的输入

DF1

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L), .Label = c("00:00,location,long|", "00:00,location,long|00:00,location,same|", 
"00:00,location,long|00:00,runapps,com.sol.sviewcall|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,runapps,com.kakao.talk|00:00,screen,OFF|", 
"00:00,location,long|00:00,wifi,dlink, PATECH-AP|00:00,wifi,dlink, iptime|00:00,wifi,dlink|", 
"00:00,location,long|00:00,wifi,dlink|", "00:00,location,long|00:00,wifi,dlink|00:00,location,same|00:00,wifi,dlink, iptime|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-183L))

DF2

structure(list(X00.00.location.long. = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), .Label = c("00:00,location,long|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, iptime|", 
"00:00,location,long|00:00,bluetooth,SCH-W860(35**)|00:00,wifi,dlink, SK_WiFi26C4, U+zone, U+Net642B|", 
"00:00,location,long|00:00,wifi,dlink, SK_WiFi26C4|", "01:00,location,long|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,screen,OFF|01:00,runapps,com.kakao.talk|", 
"01:00,location,long|01:00,bluetooth,SCH-W860(35**)|01:00,wifi,dlink, iptime, SK_WiFi26C4|01:00,wifi,dlink, iptime, PISnet_4D9740|01:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|01:00,runapps,com.buzzpia.aqua.launcher|01:00,screen,OFF|", 
"01:00,location,long|01:00,screen,OFF|", "02:00,location,long|02:00,wifi,dlink, iptime, SK_WiFi26C4|02:00,wifi,dlink, iptime, SK_WiFi26C4, KT_WLAN_BBE3|02:00,wifi,dlink, iptime, KT_WLAN_BBE3|02:00,runapps,com.kakao.talk|02:00,screen,OFF|", 
"02:00,location,long|02:00,wifi,dlink, iptime|02:00,runapps,com.buzzpia.aqua.launcher|02:00,runapps,com.android.mms|02:00,screen,OFF|"
), class = "factor")), .Names = "X00.00.location.long.", class = "data.frame", row.names = c(NA, 
-232L))

我的问题是:

  1. 我想知道所有行的匹配数据的百分比 例如df1和df2之间具有相同数据的行数。

  2. 我想知道所有行的相似性数据的百分比,一     数据看起来像" 00 :, location,long"我使用分隔符" |"至     将一个数据与其他数据分开在这种情况下,如果在df1和df1中有一行     df2> = 75%相似,我认为行类似。例如,行包含三个数据,两个数据相同,一个数据不同,类似

  3. 我想知道df1中所有行的不同数据的百分比 和df2
  4. 所以,我想计算,匹配行的百分比(df1中的行数与df2中的行匹配),相似行的百分比(df1中的行数与df2中的行类似)和百分比不同的行(df1中有多少行与df2中的行不同)

    基础数据是df1,我的意思是我想知道df2与df1相匹配,相似或不同的行数

    我使用R语言,我尝试了但是我坚持了。 希望有人能给我一个亮点

1 个答案:

答案 0 :(得分:1)

我猜你的问题是找到df2中不在df1中的所有行或df2中df1中的所有行。 如果你的意思是,你可以使用sqldf

library(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
df2Indf1 <- sqldf('SELECT * FROM df2 INTERSECT SELECT * FROM df1')

另一种方法是,您可以使用dplyr

library(dplyr)
anti_join(df2,df1)
semi_join(df2,df1)

对于相似性,如果您要测量两个字符串数据之间的相似度得分,则可以使用Levenshtein Distance查看this link中的详细信息。您可以将其应用于数据框。