我需要散布(取消透视)此Customer
数据框:
Value
Customer.CustomerID 21110001
Customer.AccountID 21110001
Customer.CustomerTaxID 123123123
Customer.CompanyName S LDA
Customer.BillingAddress.AddressDetail Desconhecido
Customer.BillingAddress.City Desconhecido
Customer.BillingAddress.PostalCode Desconhecido
Customer.BillingAddress.Country PT
Customer.ShipToAddress.AddressDetail Desconhecido
Customer.ShipToAddress.City Desconhecido
Customer.ShipToAddress.PostalCode Desconhecido
Customer.ShipToAddress.Country PT
Customer.SelfBillingIndicator 0
Customer.CustomerID.1 21110002
Customer.AccountID.1 21110002
Customer.CustomerTaxID.1 321321321
Customer.CompanyName.1 RLDA
Customer.BillingAddress.AddressDetail.1 Desconhecido
Customer.BillingAddress.City.1 Desconhecido
Customer.BillingAddress.PostalCode.1 Desconhecido
Customer.BillingAddress.Country.1 PT
Customer.ShipToAddress.AddressDetail.1 Desconhecido
Customer.ShipToAddress.City.1 Desconhecido
Customer.ShipToAddress.PostalCode.1 Desconhecido
Customer.ShipToAddress.Country.1 PT
Customer.SelfBillingIndicator.1 0
Customer.CustomerID.2 21110004
Customer.AccountID.2 21110004
Customer.CustomerTaxID.2 999999999
Customer.CompanyName.2 RTDA
Customer.BillingAddress.AddressDetail.2 Desconhecido
Customer.BillingAddress.City.2 Desconhecido
Customer.BillingAddress.PostalCode.2 Desconhecido
Customer.BillingAddress.Country.2 PT
Customer.ShipToAddress.AddressDetail.2 Desconhecido
Customer.ShipToAddress.City.2 Desconhecido
Customer.ShipToAddress.PostalCode.2 Desconhecido
Customer.ShipToAddress.Country.2 PT
Customer.SelfBillingIndicator.2 0
我正在尝试:
Customer <- Customer %>%
rownames_to_column %>%
transmute(mycols = gsub('^.*\\.', '', gsub('.[[:digit:]]+', '', rowname)),
numlinha = regmatches(rowname, gregexpr('[0-9]+',rowname)),
value = Value) %>%
spread(key=mycols, value=value)
这将返回错误:
Error: Duplicate identifiers for rows (5, 9)
我认为是因为错误消息指示gsub
中的正则表达式不能很好地处理行名Customer.BillingAddress.AddressDetail
和Customer.ShipToAddress.AddressDetail
。
所需的输出将是一个数据帧,其中CustomerID
,AccountID
,BillingAddress_Detail
,ShipToAddress_Detail
等将是列名。
但是我已经花了好几个小时来寻找更好的正则表达式,我似乎还是找不到。帮助任何人吗?
编辑: @Ronak Shah,这是我得到的结果: 第一行偏移一列。
答案 0 :(得分:1)
如果您可以拥有至少2位客户的数据,这将更加容易获得帮助。无论如何,我根据自己的理解为其中的两个创建了数据样本。由于有多个客户,并且在数据框中不可能有重复的行名,因此我假设行名中将有一个前导数字。我们可以使用gsub
删除它们,为spread
创建一个标识符行。在此,我根据显示的示例数据假设,如果您要更改each
中的rep
自变量,则每个客户有10个字段。
library(tidyverse)
df %>%
rownames_to_column() %>%
mutate(rowname = gsub("Customer\\.|\\.\\d+$", "", rowname),
spread_row = rep(seq_len(n()), each = 10, length.out = n())) %>%
spread(rowname, Value)
# spread_row AccountID BillingAddress.AddressDetail BillingAddress.City BillingAddress.Country
#1 1 21110001 Desconhecido Desconhecido ZPT
#2 2 21110001 Desconhecidorobes Desconhecido ZPT
# BillingAddress.PostalCode CompanyName CustomerID CustomerTaxID ShipToAddress.AddressDetail ShipToAddress.City
#1 Desconhecido SLD 21110001 123123123 Desconhecido Desconhecido
#2 Desconhecido SLD 21110002 123123123 Desconhecido Desconhecido
数据
df <- structure(list(Value = c("21110001", "21110001", "123123123",
"SLD", "Desconhecido", "Desconhecido", "Desconhecido", "ZPT",
"Desconhecido", "Desconhecido", "21110002", "21110001", "123123123",
"SLD", "Desconhecidorobes", "Desconhecido", "Desconhecido", "ZPT",
"Desconhecido", "Desconhecido")), row.names = c("Customer.CustomerID",
"Customer.AccountID", "Customer.CustomerTaxID", "Customer.CompanyName",
"Customer.BillingAddress.AddressDetail", "Customer.BillingAddress.City",
"Customer.BillingAddress.PostalCode", "Customer.BillingAddress.Country",
"Customer.ShipToAddress.AddressDetail", "Customer.ShipToAddress.City",
"Customer.CustomerID1", "Customer.AccountID1", "Customer.CustomerTaxID1",
"Customer.CompanyName1", "Customer.BillingAddress.AddressDetail1",
"Customer.BillingAddress.City1", "Customer.BillingAddress.PostalCode1",
"Customer.BillingAddress.Country1", "Customer.ShipToAddress.AddressDetail1",
"Customer.ShipToAddress.City1"), class = "data.frame")