在标识列中添加尾随零

时间:2018-09-20 17:21:15

标签: r

我有以下数据,如下所示:

            COLUPC               UPC
19638 715643501208 00-01-82000-72608
11783  15230000022 00-01-01820-00198
17140  18769210012 07-01-93271-20600
19624 711925600019 00-01-86243-11167
18706  28382012393 27-01-15503-01673
8716   13410025574 00-01-78250-00236

COLUPCUPC来自两个不同的数据帧,并对每个数据帧进行了随机采样。 COLUPCUPC是相同的唯一标识符。只是COLUPC失去了格式。

COLUPC-715643501208的第一行值。它看起来像07-01-56435-01208,其结构与列UPC中的结构相同。

通过观察15230000022会变得有些棘手,看起来应该像01-05-23000-00022,在第四个组件0中除了组件1之外还添加了额外的00022 2。

数据文件随附的文档指出COLUPC没有显示前导零。

所以;

组件1是2位数字

组件2是2位数字(1位数字开头是0)

第3部分是5位数字

组件4是5位数字

00-01-01820-00198栏中获取此数字UPC。在COLUPC列中,该字段将折叠为11820198,其中所有前导零都已从组件1、2、3和4中删除。

如果我不清楚,请告诉我。

数据:

structure(list(COLUPC = c(715643501208, 15230000022, 18769210012, 
711925600019, 28382012393, 13410025574, 716708427455, 12210000157, 
13410025602, 17143945712, 17336011341, 13410062505, 11820053218, 
11820002989, 13410059505, 11820043550, 18417331130, 711820053025, 
18982650001, 18248813000, 17199077603, 18834510005, 13410048602, 
18417310545, 11820000987, 13410000217, 17089701331, 13410017256, 
618516510302, 23410030602, 17336080301, 613498630003, 17825000257, 
16206705155, 17031000013, 13410000555, 16206704972, 37031030703, 
27336082301, 17031023882, 18769227102, 18382056793, 31820000697, 
13410048574, 17199048004, 11820041550, 11820000884, 18572511509, 
14182700012, 13410010505, 18700050898, 18248812910, 18015761147, 
18549873837, 16821309117, 618516510402, 17231163009, 611969900006, 
11820011349, 11820000769, 13410046805, 13410000162, 13410000579, 
18374110012, 21820000801, 17231163011, 23410068505, 17199030030, 
11820000771, 17031000009, 28549873736, 11820000784, 17199048016, 
13410057256, 11820054028, 11820001991, 13410057602, 17336080341, 
21820029031, 715643501210, 17231163012, 17199047004, 18769201103, 
13410000623, 15230000540, 618516510704, 17336011348, 15230001301, 
618516510707, 17143934806, 18572511510, 28248800006, 13410057536, 
611969900002, 11820041047, 13410057340, 18769241103, 17199000025, 
21820000803, 14182700712), UPC = c("00-01-82000-72608", "00-01-01820-00198", 
"07-01-93271-20600", "00-01-86243-11167", "27-01-15503-01673", 
"00-01-78250-00236", "00-01-18200-96664", "00-01-99232-00055", 
"27-01-15504-02696", "00-01-34100-15306", "00-01-95301-13152", 
"00-03-22405-55555", "00-02-76398-50000", "00-01-82000-72750", 
"00-01-88573-50001", "00-01-72890-00122", "00-02-18200-00406", 
"00-03-18200-01168", "00-01-83820-73500", "27-01-00007-13019", 
"00-01-82548-20702", "07-01-19283-60388", "07-01-14834-00121", 
"07-01-02596-10001", "27-01-15503-03400", "06-01-30279-33706", 
"27-01-04200-18188", "07-01-80369-61992", "00-01-82153-19309", 
"00-01-34100-59205", "27-01-15504-00652", "00-02-18200-86030", 
"07-01-08820-49939", "27-01-04203-24019", "27-01-15503-00627", 
"27-01-15504-01210", "27-01-15504-01305", "00-02-18200-00006", 
"00-01-85498-66662", "00-01-50939-12201", "00-01-95374-70000", 
"00-01-81986-00187", "00-01-72620-20101", "07-02-19256-00043", 
"00-01-68213-05117", "07-01-86243-01106", "00-03-21242-00004", 
"27-01-15503-00969", "00-01-62067-38055", "07-01-36040-00657", 
"00-01-78250-00388", "00-01-62067-04971", "00-01-18200-15047", 
"27-01-00001-60539", "00-01-70378-54652", "00-01-52300-00005", 
"27-01-15504-01620", "00-01-70310-35738", "07-03-36920-11141", 
"27-01-15503-00049", "00-01-80480-90593", "00-01-18200-27168", 
"00-01-34100-14574", "07-01-92098-07234", "27-01-00007-05978", 
"06-01-05748-02000", "00-01-34100-92528", "06-01-43775-10306", 
"27-01-15503-01146", "27-01-00007-06142", "07-03-02770-00802", 
"00-04-62067-00900", "27-01-15502-01229", "00-01-73832-06340", 
"00-01-18200-00180", "07-01-94028-20030", "27-01-15504-02559", 
"07-01-90586-00122", "00-01-92212-11102", "07-01-80369-91993", 
"00-01-18200-18500", "00-01-18200-00997", "00-01-18200-41372", 
"00-01-34100-50505", "00-01-92212-12111", "27-01-15503-00078", 
"27-01-15503-07766", "00-01-80069-00221", "27-01-15504-03101", 
"00-02-77940-00002", "00-01-02770-01302", "07-01-27658-12302", 
"00-01-34100-62152", "00-01-72620-20128", "00-01-71990-77952", 
"00-01-85725-21004", "07-01-08820-58624", "07-02-88946-10141", 
"00-01-81986-00144", "00-01-18200-00242")), class = "data.frame", row.names = c(19638L, 
11783L, 17140L, 19624L, 18706L, 8716L, 19653L, 5259L, 8738L, 
13844L, 15570L, 11543L, 3919L, 1345L, 11412L, 3122L, 16899L, 
19583L, 17617L, 16753L, 15011L, 17582L, 9364L, 16889L, 1310L, 
5602L, 13696L, 8276L, 19528L, 18189L, 15840L, 19217L, 15982L, 
13285L, 13568L, 5737L, 13263L, 19052L, 18282L, 13613L, 17300L, 
16882L, 18785L, 9214L, 14925L, 2868L, 1009L, 17053L, 11674L, 
7364L, 17077L, 16752L, 15990L, 17038L, 13449L, 19530L, 15166L, 
19167L, 2539L, 542L, 9054L, 5597L, 5757L, 16754L, 17682L, 15167L, 
18203L, 14441L, 657L, 13551L, 18748L, 747L, 14953L, 9903L, 4182L, 
1330L, 11094L, 15936L, 17768L, 19642L, 15173L, 14921L, 17090L, 
5759L, 12765L, 19579L, 15588L, 12885L, 19581L, 13814L, 17054L, 
18697L, 11085L, 19136L, 2764L, 10523L, 17465L, 13960L, 17724L, 
11687L))

1 个答案:

答案 0 :(得分:1)

我认为您可以使用基于gsub的两个nchar语句来完成此操作。我在这里假设,对于COLUPC中的11位数字值,您总是在第四组上总是缺少零。我在df1通话中呼叫了您的数据框sapply。如果您使用其他名称命名,则需要交换名称。

我们将定义一个辅助函数以传递给sapply

fixUPC <- function(x){
    if (nchar(x) == 11) {
        # we catch each Component in a separate group, then reconstruct in 
        # our substitution adding zeros and dashes as necessary
        gsub("^(\\d)(\\d)(\\d{5})(\\d{4})","0\\1-0\\2-\\3-0\\4",x)
    } else if (nchar(x) == 12){
        gsub("^(\\d)(\\d)(\\d{5})(\\d{5})","0\\1-0\\2-\\3-\\4",x)
    }
}

df1$fixedUPC <- sapply(df1$COLUPC, fixUPC)

head(df1$fixedUPC)

[1] "07-01-56435-01208" "01-05-23000-00022" "01-08-76921-00012"
[4] "07-01-19256-00019" "02-08-38201-02393" "01-03-41002-05574"

您还可以将其扩展到丢失了3个以上字符(即nchar == 10)的情况,但是,随着丢失更多数字,您能否准确填入零将取决于UPC开始的一致性用。对于您的最后一个示例11820198,原始的UPC是00-01-01820-00198还是01-01-82019-00008变得模棱两可。如果您可以肯定地说零总是在某个nchar值上出现一个位置或另一个位置,那您就很好了,但是在整个数据集中可能并非如此。