我有以下数据,如下所示:
COLUPC UPC
19638 715643501208 00-01-82000-72608
11783 15230000022 00-01-01820-00198
17140 18769210012 07-01-93271-20600
19624 711925600019 00-01-86243-11167
18706 28382012393 27-01-15503-01673
8716 13410025574 00-01-78250-00236
列COLUPC
和UPC
来自两个不同的数据帧,并对每个数据帧进行了随机采样。 COLUPC
和UPC
是相同的唯一标识符。只是COLUPC
失去了格式。
取COLUPC
-715643501208
的第一行值。它看起来像07-01-56435-01208
,其结构与列UPC
中的结构相同。
通过观察15230000022
会变得有些棘手,看起来应该像01-05-23000-00022
,在第四个组件0
中除了组件1之外还添加了额外的00022
2。
数据文件随附的文档指出COLUPC
没有显示前导零。
所以;
组件1是2位数字
组件2是2位数字(1位数字开头是0)
第3部分是5位数字
组件4是5位数字
从00-01-01820-00198
栏中获取此数字UPC
。在COLUPC
列中,该字段将折叠为11820198
,其中所有前导零都已从组件1、2、3和4中删除。
如果我不清楚,请告诉我。
数据:
structure(list(COLUPC = c(715643501208, 15230000022, 18769210012,
711925600019, 28382012393, 13410025574, 716708427455, 12210000157,
13410025602, 17143945712, 17336011341, 13410062505, 11820053218,
11820002989, 13410059505, 11820043550, 18417331130, 711820053025,
18982650001, 18248813000, 17199077603, 18834510005, 13410048602,
18417310545, 11820000987, 13410000217, 17089701331, 13410017256,
618516510302, 23410030602, 17336080301, 613498630003, 17825000257,
16206705155, 17031000013, 13410000555, 16206704972, 37031030703,
27336082301, 17031023882, 18769227102, 18382056793, 31820000697,
13410048574, 17199048004, 11820041550, 11820000884, 18572511509,
14182700012, 13410010505, 18700050898, 18248812910, 18015761147,
18549873837, 16821309117, 618516510402, 17231163009, 611969900006,
11820011349, 11820000769, 13410046805, 13410000162, 13410000579,
18374110012, 21820000801, 17231163011, 23410068505, 17199030030,
11820000771, 17031000009, 28549873736, 11820000784, 17199048016,
13410057256, 11820054028, 11820001991, 13410057602, 17336080341,
21820029031, 715643501210, 17231163012, 17199047004, 18769201103,
13410000623, 15230000540, 618516510704, 17336011348, 15230001301,
618516510707, 17143934806, 18572511510, 28248800006, 13410057536,
611969900002, 11820041047, 13410057340, 18769241103, 17199000025,
21820000803, 14182700712), UPC = c("00-01-82000-72608", "00-01-01820-00198",
"07-01-93271-20600", "00-01-86243-11167", "27-01-15503-01673",
"00-01-78250-00236", "00-01-18200-96664", "00-01-99232-00055",
"27-01-15504-02696", "00-01-34100-15306", "00-01-95301-13152",
"00-03-22405-55555", "00-02-76398-50000", "00-01-82000-72750",
"00-01-88573-50001", "00-01-72890-00122", "00-02-18200-00406",
"00-03-18200-01168", "00-01-83820-73500", "27-01-00007-13019",
"00-01-82548-20702", "07-01-19283-60388", "07-01-14834-00121",
"07-01-02596-10001", "27-01-15503-03400", "06-01-30279-33706",
"27-01-04200-18188", "07-01-80369-61992", "00-01-82153-19309",
"00-01-34100-59205", "27-01-15504-00652", "00-02-18200-86030",
"07-01-08820-49939", "27-01-04203-24019", "27-01-15503-00627",
"27-01-15504-01210", "27-01-15504-01305", "00-02-18200-00006",
"00-01-85498-66662", "00-01-50939-12201", "00-01-95374-70000",
"00-01-81986-00187", "00-01-72620-20101", "07-02-19256-00043",
"00-01-68213-05117", "07-01-86243-01106", "00-03-21242-00004",
"27-01-15503-00969", "00-01-62067-38055", "07-01-36040-00657",
"00-01-78250-00388", "00-01-62067-04971", "00-01-18200-15047",
"27-01-00001-60539", "00-01-70378-54652", "00-01-52300-00005",
"27-01-15504-01620", "00-01-70310-35738", "07-03-36920-11141",
"27-01-15503-00049", "00-01-80480-90593", "00-01-18200-27168",
"00-01-34100-14574", "07-01-92098-07234", "27-01-00007-05978",
"06-01-05748-02000", "00-01-34100-92528", "06-01-43775-10306",
"27-01-15503-01146", "27-01-00007-06142", "07-03-02770-00802",
"00-04-62067-00900", "27-01-15502-01229", "00-01-73832-06340",
"00-01-18200-00180", "07-01-94028-20030", "27-01-15504-02559",
"07-01-90586-00122", "00-01-92212-11102", "07-01-80369-91993",
"00-01-18200-18500", "00-01-18200-00997", "00-01-18200-41372",
"00-01-34100-50505", "00-01-92212-12111", "27-01-15503-00078",
"27-01-15503-07766", "00-01-80069-00221", "27-01-15504-03101",
"00-02-77940-00002", "00-01-02770-01302", "07-01-27658-12302",
"00-01-34100-62152", "00-01-72620-20128", "00-01-71990-77952",
"00-01-85725-21004", "07-01-08820-58624", "07-02-88946-10141",
"00-01-81986-00144", "00-01-18200-00242")), class = "data.frame", row.names = c(19638L,
11783L, 17140L, 19624L, 18706L, 8716L, 19653L, 5259L, 8738L,
13844L, 15570L, 11543L, 3919L, 1345L, 11412L, 3122L, 16899L,
19583L, 17617L, 16753L, 15011L, 17582L, 9364L, 16889L, 1310L,
5602L, 13696L, 8276L, 19528L, 18189L, 15840L, 19217L, 15982L,
13285L, 13568L, 5737L, 13263L, 19052L, 18282L, 13613L, 17300L,
16882L, 18785L, 9214L, 14925L, 2868L, 1009L, 17053L, 11674L,
7364L, 17077L, 16752L, 15990L, 17038L, 13449L, 19530L, 15166L,
19167L, 2539L, 542L, 9054L, 5597L, 5757L, 16754L, 17682L, 15167L,
18203L, 14441L, 657L, 13551L, 18748L, 747L, 14953L, 9903L, 4182L,
1330L, 11094L, 15936L, 17768L, 19642L, 15173L, 14921L, 17090L,
5759L, 12765L, 19579L, 15588L, 12885L, 19581L, 13814L, 17054L,
18697L, 11085L, 19136L, 2764L, 10523L, 17465L, 13960L, 17724L,
11687L))
答案 0 :(得分:1)
我认为您可以使用基于gsub
的两个nchar
语句来完成此操作。我在这里假设,对于COLUPC
中的11位数字值,您总是在第四组上总是缺少零。我在df1
通话中呼叫了您的数据框sapply
。如果您使用其他名称命名,则需要交换名称。
我们将定义一个辅助函数以传递给sapply
:
fixUPC <- function(x){
if (nchar(x) == 11) {
# we catch each Component in a separate group, then reconstruct in
# our substitution adding zeros and dashes as necessary
gsub("^(\\d)(\\d)(\\d{5})(\\d{4})","0\\1-0\\2-\\3-0\\4",x)
} else if (nchar(x) == 12){
gsub("^(\\d)(\\d)(\\d{5})(\\d{5})","0\\1-0\\2-\\3-\\4",x)
}
}
df1$fixedUPC <- sapply(df1$COLUPC, fixUPC)
head(df1$fixedUPC)
[1] "07-01-56435-01208" "01-05-23000-00022" "01-08-76921-00012"
[4] "07-01-19256-00019" "02-08-38201-02393" "01-03-41002-05574"
您还可以将其扩展到丢失了3个以上字符(即nchar == 10
)的情况,但是,随着丢失更多数字,您能否准确填入零将取决于UPC开始的一致性用。对于您的最后一个示例11820198
,原始的UPC是00-01-01820-00198
还是01-01-82019-00008
变得模棱两可。如果您可以肯定地说零总是在某个nchar
值上出现一个位置或另一个位置,那您就很好了,但是在整个数据集中可能并非如此。