R数据帧中的正则表达式错误

时间:2019-08-11 22:47:32

标签: r regex tidyr

我需要散布(取消透视)此Customer数据框:

                                                                  Value
Customer.CustomerID                                            21110001
Customer.AccountID                                             21110001
Customer.CustomerTaxID                                        123123123
Customer.CompanyName                                              S LDA
Customer.BillingAddress.AddressDetail                      Desconhecido
Customer.BillingAddress.City                               Desconhecido
Customer.BillingAddress.PostalCode                         Desconhecido
Customer.BillingAddress.Country                                      PT
Customer.ShipToAddress.AddressDetail                       Desconhecido
Customer.ShipToAddress.City                                Desconhecido
Customer.ShipToAddress.PostalCode                          Desconhecido
Customer.ShipToAddress.Country                                       PT
Customer.SelfBillingIndicator                                         0
Customer.CustomerID.1                                          21110002
Customer.AccountID.1                                           21110002
Customer.CustomerTaxID.1                                      321321321
Customer.CompanyName.1                                             RLDA
Customer.BillingAddress.AddressDetail.1                    Desconhecido
Customer.BillingAddress.City.1                             Desconhecido
Customer.BillingAddress.PostalCode.1                       Desconhecido
Customer.BillingAddress.Country.1                                    PT
Customer.ShipToAddress.AddressDetail.1                     Desconhecido
Customer.ShipToAddress.City.1                              Desconhecido
Customer.ShipToAddress.PostalCode.1                        Desconhecido
Customer.ShipToAddress.Country.1                                     PT
Customer.SelfBillingIndicator.1                                       0
Customer.CustomerID.2                                          21110004
Customer.AccountID.2                                           21110004
Customer.CustomerTaxID.2                                      999999999
Customer.CompanyName.2                                             RTDA
Customer.BillingAddress.AddressDetail.2                    Desconhecido
Customer.BillingAddress.City.2                             Desconhecido
Customer.BillingAddress.PostalCode.2                       Desconhecido
Customer.BillingAddress.Country.2                                    PT
Customer.ShipToAddress.AddressDetail.2                     Desconhecido
Customer.ShipToAddress.City.2                              Desconhecido
Customer.ShipToAddress.PostalCode.2                        Desconhecido
Customer.ShipToAddress.Country.2                                     PT
Customer.SelfBillingIndicator.2                                       0

我正在尝试:

Customer <- Customer %>% 
  rownames_to_column %>% 
  transmute(mycols = gsub('^.*\\.', '', gsub('.[[:digit:]]+', '', rowname)),
            numlinha = regmatches(rowname, gregexpr('[0-9]+',rowname)),
            value = Value) %>% 
  spread(key=mycols, value=value) 

这将返回错误:

Error: Duplicate identifiers for rows (5, 9)

我认为是因为错误消息指示gsub中的正则表达式不能很好地处理行名Customer.BillingAddress.AddressDetailCustomer.ShipToAddress.AddressDetail

所需的输出将是一个数据帧,其中CustomerIDAccountIDBillingAddress_DetailShipToAddress_Detail等将是列名。

但是我已经花了好几个小时来寻找更好的正则表达式,我似乎还是找不到。帮助任何人吗?

编辑: @Ronak Shah,这是我得到的结果: 第一行偏移一列。

enter image description here

1 个答案:

答案 0 :(得分:1)

如果您可以拥有至少2位客户的数据,这将更加容易获得帮助。无论如何,我根据自己的理解为其中的两个创建了数据样本。由于有多个客户,并且在数据框中不可能有重复的行名,因此我假设行名中将有一个前导数字。我们可以使用gsub删除它们,为spread创建一个标识符行。在此,我根据显示的示例数据假设,如果您要更改each中的rep自变量,则每个客户有10个字段。

library(tidyverse)

df %>%
  rownames_to_column() %>%
  mutate(rowname = gsub("Customer\\.|\\.\\d+$", "", rowname),  
         spread_row = rep(seq_len(n()), each = 10, length.out = n())) %>%
  spread(rowname, Value)

#  spread_row AccountID BillingAddress.AddressDetail BillingAddress.City BillingAddress.Country
#1          1  21110001                 Desconhecido        Desconhecido                    ZPT
#2          2  21110001            Desconhecidorobes        Desconhecido                    ZPT

#  BillingAddress.PostalCode CompanyName CustomerID CustomerTaxID ShipToAddress.AddressDetail ShipToAddress.City
#1              Desconhecido         SLD   21110001     123123123                Desconhecido       Desconhecido
#2              Desconhecido         SLD   21110002     123123123                Desconhecido       Desconhecido

数据

df <- structure(list(Value = c("21110001", "21110001", "123123123", 
"SLD", "Desconhecido", "Desconhecido", "Desconhecido", "ZPT", 
"Desconhecido", "Desconhecido", "21110002", "21110001", "123123123", 
"SLD", "Desconhecidorobes", "Desconhecido", "Desconhecido", "ZPT", 
"Desconhecido", "Desconhecido")), row.names = c("Customer.CustomerID", 
"Customer.AccountID", "Customer.CustomerTaxID", "Customer.CompanyName", 
"Customer.BillingAddress.AddressDetail", "Customer.BillingAddress.City", 
"Customer.BillingAddress.PostalCode", "Customer.BillingAddress.Country", 
"Customer.ShipToAddress.AddressDetail", "Customer.ShipToAddress.City", 
"Customer.CustomerID1", "Customer.AccountID1", "Customer.CustomerTaxID1", 
"Customer.CompanyName1", "Customer.BillingAddress.AddressDetail1", 
"Customer.BillingAddress.City1", "Customer.BillingAddress.PostalCode1", 
"Customer.BillingAddress.Country1", "Customer.ShipToAddress.AddressDetail1", 
"Customer.ShipToAddress.City1"), class = "data.frame")