重复行重新变长

时间:2017-08-30 16:54:08

标签: r dataframe transform reshape reshape2

我的数据框df如下所示:

Label              Info
1  0-22 Records    N/A 
2  0-22 Records    Poland
3  0-22 Records    N/A 
4  0-22 Records    active
5  0-22 Records    Hardcore 
6  0-22 Records    N/A 
7  0-22 Records    N/A
8  Nuclear Blast   "Oeschstr. 40 73072 Donzdorf"
9  Nuclear Blast   Germany
10 Nuclear Blast   +49 7162 9280-0 
11 Nuclear Blast   active
12 Nuclear Blast   Hardcore (early), Metal and subgenres 
13 Nuclear Blast   1987
14 Nuclear Blast   "Anstalt Records, Arctic Serenades, Cannibalised Serial Killer, Deathwish Office, Epica, Gore Records, Grind Syndicate Media,                                  Mind Control Records, Nuclear Blast America, Nuclear Blast Brasil,                                  Nuclear Blast Entertainment, Radiation Records, Revolution Entertainment"
15 Nuclear Blast   Yes

我想重塑到df看起来像的地方:

  Label         Address    Country      Phone      Status       Genre      Year      Sub        Online
1 0-22 Records  N/A        Poland       N/A        active       Hardcore   N/A       N/A        N/A
2 Nuclear Blast "Oes.."    Germany      +49...
   .
   .

重复行数从7到9不等,我使用了reshapereshape2,并将键分配给"标签"无济于事。

编辑:dput

structure(list(label = c("0-22 Records", "0-22 Records", "0-22 Records", 
 "0-22 Records", "0-22 Records", "0-22 Records", "0-22 Records", 
 "Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast", 
 "Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast", 
 "Metal Blade Records", "Metal Blade Records", "Metal Blade Records", 
 "Metal Blade Records", "Metal Blade Records"), info = c(" N/A ", 
 "Poland", " N/A ", "active", " Hardcore ", " N/A ", "N/A", " Oeschstr. 
 40\r\n73072 Donzdorf ", 
 "Germany", " +49 7162 9280-0 ", "active", " Hardcore (early), Metal and 
 subgenres ", " 1987 ", "\n\t\t\t\t\t\t\t\t\tAnstalt 
 Records,\t\t\t\t\t\t\t\t\tArctic Serenades,\t\t\t\t\t\t\t\t\tCannibalised 
 Serial Killer,\t\t\t\t\t\t\t\t\tDeathwish 
 Office,\t\t\t\t\t\t\t\t\tEpica,\t\t\t\t\t\t\t\t\tGore 
 Records,\t\t\t\t\t\t\t\t\tGrind Syndicate Media,\t\t\t\t\t\t\t\t\tMind 
 Control Records,\t\t\t\t\t\t\t\t\tNuclear Blast 
 America,\t\t\t\t\t\t\t\t\tNuclear Blast Brasil,\t\t\t\t\t\t\t\t\tNuclear 
 Blast Entertainment,\t\t\t\t\t\t\t\t\tRadiation 
 Records,\t\t\t\t\t\t\t\t\tRevolution Entertainment\t\t\t\t\t      ", 
 "Yes", " 5737 Kanan Road #143\r\nAgoura Hills, California 91301 ", 
 "United States", " N/A ", "active", " Heavy Metal, Extreme Metal "
 )), .Names = c("label", "info"), class = c("data.table", "data.frame"
 ), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x10200db78>)

1 个答案:

答案 0 :(得分:1)

广泛数据框的新列名称(例如AddressCountry等)不会出现在df中。我们需要向df添加一列,将info映射到宽数据框的正确列名,以确保给定行的数据在重新整形后以正确的列结束

我们面临的挑战是,我们需要找到利用数据中规律性的方法,以确定info代表GenreCountryYear的哪些值,基于您提供的数据样本,这里有一些初步想法。在下面的代码中,case_when语句尝试将info映射到新列名称。按顺序排列,case_when语句中的语句正在尝试执行以下操作:

  • 通过识别包含国家/地区名称的字符串
  • 来查找Country
  • 查找Status(假设它只能是&#34;有效&#34;或&#34;无效&#34;)
  • 查找Genre。在这里,您需要涵盖更多可能性。
  • 查找Year。我假设1950-2017范围内任何一个四位数的行代表一年。根据需要进行调整。
  • 查找Phone。我以为它总是以+开头,所以你可能需要更复杂的东西。
  • 查找Online(假设它只能是&#34;是&#34;或&#34;否&#34;,并且不会映射到不同列的行只包含单词&#34;是&#34;或&#34;否&#34;)
  • 查找Sub。您可能需要更复杂的策略。现在,我假设行包含“&34;记录&#34;或&#34;娱乐&#34;或者有三个或更多逗号的行是Sub行。
  • 如果某行与上述任何语句都不匹配,则假定它是一个地址。

您需要使用这些内容并查看数据上下文中的内容。

library(stringr)
library(tidyverse)
library(countrycode)
data("countrycode_data")

df %>% 
  filter(!grepl("N/A", info)) %>% 
  mutate(info = str_trim(gsub("\r*\t*|\n*| {2,}", "", info)),
         NewCols = case_when(sapply(info, function(x) any(grepl(x, countrycode_data$country.name.en))) ~ "Country",  
                             grepl("active", info) ~ "Status",                                                         
                             grepl("hardcore|metal|rock|classical", info, ignore.case=TRUE) ~ "Genre",
                             info %in% 1950:2017 ~ "Year",
                             grepl("^\\+", info) ~ "Phone",
                             grepl("^Yes$|^No$", info) ~ "Online",
                             grepl("Records|Entertainment|,{3,}", info) ~ "Sub",
                             TRUE ~ "Address")) %>% 
  group_by(label) %>% 
  spread(NewCols, info)

这是输出(我为了节省空间而截断Sub的长值):

                label                                            Address       Country                                 Genre Online           Phone Status            Sub  Year
                <chr>                                              <chr>         <chr>                                 <chr>  <chr>           <chr>  <chr>          <chr> <chr>
1        0-22 Records                                               <NA>        Poland                              Hardcore   <NA>            <NA> active             NA  <NA>
2 Metal Blade Records 5737 Kanan Road #143Agoura Hills, California 91301 United States            Heavy Metal, Extreme Metal   <NA>            <NA> active             NA  <NA>
3       Nuclear Blast                         Oeschstr. 4073072 Donzdorf       Germany Hardcore (early), Metal and subgenres    Yes +49 7162 9280-0 active  Anstalt Re...  1987

原始答案(在数据样本可用之前)

如果每个Label都有九行,并且每行中的数据类型对于每个Label的顺序始终相同,那么一个解决方案就是:

library(tidyverse)

df.wide = df %>% 
  group_by(Label) %>% 
  mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>% 
  spread(NewCols, Info)

您可以在实际数据中为任何具有9行的Label级别实现此功能。

df.wide9 = df %>% 
  group_by(Label) %>% 
  filter(n()==9) %>% 
  mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>% 
  spread(NewCols, Info)

对于具有8行或7行的Label级别,如果缺失的行始终表示相同类型的数据,例如,请说地址行总是缺少8行。 -row级别Label,然后你可以做(​​再一次,假设每个Label的数据数据类型顺序相同):

df.wide8 = df %>% 
  group_by(Label) %>% 
  filter(n()==8) %>% 
  mutate(NewCols = rep(c("Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>% 
  spread(NewCols, Info)

然后你可以将它们与df.wide = bind_rows(df.wide8, df.wide9)放在一起。

如果您提供更多信息,我们可能会提供适合您实际数据的解决方案。