摘要 - 仅修改一个

时间:2016-10-13 16:17:06

标签: r encryption mask digest anonymize

我正在尝试创建一个闪亮的应用程序,允许用户选择要加密的列,如果数据相同,则每行中的值在后续运行中应始终相同。即如果客户名=“John”,那么在运行此流程时总是会得到“A”,如果客户名称更改为“Jon”,则可以获得“C”...但如果更改回“John”,您将再次获得A.这将用于“掩盖”敏感数据以进行分析。

此外,如果有人可以通过存储稍后要使用的密钥来使用某种方法来“解密”这些列...那将不胜感激。

我试图完成此操作的简单版本(需要摘要库):

test <- data.frame(CustomerName=c("John Snow","John Snow","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Joe Farmer","Joe Farmer","Joe Farmer","Joe Farmer"),
               LoanNumber=c("12548","45878","45796","45813","45125","45216","45125","45778","45126","32548","45683"),
               LoanBalance=c("458463","5412548","458463","5412548","458463","5412548","458463","5412548","458463","5412548","2484722"),
               FarmType=c("Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy"))


test[,1] <- sapply(test[,1],digest,algo="sha1")

示例输出:

                                   CustomerName LoanNumber LoanBalance FarmType
1  5c96f777a14f201a6a9b79623d548f7ab61c7a11      12548      458463      Hay
2  5c96f777a14f201a6a9b79623d548f7ab61c7a11      45878     5412548    Dairy
3  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45796      458463     Fish
4  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45813     5412548      Hay
5  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463    Dairy
6  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45216     5412548     Fish
7  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463      Hay
8  b0db86a39b9617cef61a8986fd57af7960eec9f4      45778     5412548    Dairy
9  b0db86a39b9617cef61a8986fd57af7960eec9f4      45126      458463     Fish
10 b0db86a39b9617cef61a8986fd57af7960eec9f4      32548     5412548      Hay
11 b0db86a39b9617cef61a8986fd57af7960eec9f4      45683     2484722    Dairy

修改后的数据框(在John中删除了'h'):

    test <- data.frame(CustomerName=c("Jon Snow","Jon Snow","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Daffy Duck","Joe Farmer","Joe Farmer","Joe Farmer","Joe Farmer"),
           LoanNumber=c("12548","45878","45796","45813","45125","45216","45125","45778","45126","32548","45683"),
           LoanBalance=c("458463","5412548","458463","5412548","458463","5412548","458463","5412548","458463","5412548","2484722"),
           FarmType=c("Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy","Fish","Hay","Dairy"))
test[,1] <- sapply(test[,1],digest,algo="sha1")

新输出:

                                   CustomerName LoanNumber LoanBalance FarmType
1  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      12548      458463      Hay
2  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      45878     5412548    Dairy
3  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45796      458463     Fish
4  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45813     5412548      Hay
5  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45125      458463    Dairy
6  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45216     5412548     Fish
7  b0187b6ff2322fa86004d4d22cd479f3cdc345d2      45125      458463      Hay
8  2127453066c45db6ba7e2f6f8c14d22796c3fd54      45778     5412548    Dairy
9  2127453066c45db6ba7e2f6f8c14d22796c3fd54      45126      458463     Fish
10 2127453066c45db6ba7e2f6f8c14d22796c3fd54      32548     5412548      Hay
11 2127453066c45db6ba7e2f6f8c14d22796c3fd54      45683     2484722    Dairy

我的期望:

    CustomerName LoanNumber LoanBalance FarmType
1  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      12548      458463      Hay
2  2cabeabb3b50e04d3b46ea2c68ab12c7350cd87f      45878     5412548    Dairy
3  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45796      458463     Fish
4  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45813     5412548      Hay
5  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463    Dairy
6  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45216     5412548     Fish
7  10bf345ab114c20df2d1eedbbe7e7cd6b969db05      45125      458463      Hay
8  b0db86a39b9617cef61a8986fd57af7960eec9f4      45778     5412548    Dairy
9  b0db86a39b9617cef61a8986fd57af7960eec9f4      45126      458463     Fish
10 b0db86a39b9617cef61a8986fd57af7960eec9f4      32548     5412548      Hay
11 b0db86a39b9617cef61a8986fd57af7960eec9f4      45683     2484722    Dairy

我是否误解了这是如何运作的?如果我将相同的逻辑应用于多个列,我会为未更改的列获取相同的值,但是对于具有已修改值的列,问题仍然存在。我试图对摘要函数进行矢量化,以确保我的sapply函数不是具有相同结果的问题。有什么想法吗?

1 个答案:

答案 0 :(得分:0)

我认为我已经回答了我自己的问题......当然我发布在这里之后:)。

摘要函数有一个serialize参数,其中包含以下文档:一个逻辑变量,指示是否应使用serialize(ASCII格式)序列化对象。将此设置为FALSE允许将给定字符串的摘要输出与已知控制输出进行比较。它还允许使用原始向量,例如非ASCII序列化的输出。

将serialize设置为FALSE似乎可以解决问题并获得预期的输出。

例如:

test[,1] <- sapply(test[,1],digest,algo="sha1",serialize = FALSE)