我的数据框df
如下所示:
Label Info 1 0-22 Records N/A 2 0-22 Records Poland 3 0-22 Records N/A 4 0-22 Records active 5 0-22 Records Hardcore 6 0-22 Records N/A 7 0-22 Records N/A 8 Nuclear Blast "Oeschstr. 40 73072 Donzdorf" 9 Nuclear Blast Germany 10 Nuclear Blast +49 7162 9280-0 11 Nuclear Blast active 12 Nuclear Blast Hardcore (early), Metal and subgenres 13 Nuclear Blast 1987 14 Nuclear Blast "Anstalt Records, Arctic Serenades, Cannibalised Serial Killer, Deathwish Office, Epica, Gore Records, Grind Syndicate Media, Mind Control Records, Nuclear Blast America, Nuclear Blast Brasil, Nuclear Blast Entertainment, Radiation Records, Revolution Entertainment" 15 Nuclear Blast Yes
我想重塑到df
看起来像的地方:
Label Address Country Phone Status Genre Year Sub Online
1 0-22 Records N/A Poland N/A active Hardcore N/A N/A N/A
2 Nuclear Blast "Oes.." Germany +49...
.
.
重复行数从7到9不等,我使用了reshape
和reshape2
,并将键分配给"标签"无济于事。
编辑:dput
:
structure(list(label = c("0-22 Records", "0-22 Records", "0-22 Records",
"0-22 Records", "0-22 Records", "0-22 Records", "0-22 Records",
"Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast",
"Nuclear Blast", "Nuclear Blast", "Nuclear Blast", "Nuclear Blast",
"Metal Blade Records", "Metal Blade Records", "Metal Blade Records",
"Metal Blade Records", "Metal Blade Records"), info = c(" N/A ",
"Poland", " N/A ", "active", " Hardcore ", " N/A ", "N/A", " Oeschstr.
40\r\n73072 Donzdorf ",
"Germany", " +49 7162 9280-0 ", "active", " Hardcore (early), Metal and
subgenres ", " 1987 ", "\n\t\t\t\t\t\t\t\t\tAnstalt
Records,\t\t\t\t\t\t\t\t\tArctic Serenades,\t\t\t\t\t\t\t\t\tCannibalised
Serial Killer,\t\t\t\t\t\t\t\t\tDeathwish
Office,\t\t\t\t\t\t\t\t\tEpica,\t\t\t\t\t\t\t\t\tGore
Records,\t\t\t\t\t\t\t\t\tGrind Syndicate Media,\t\t\t\t\t\t\t\t\tMind
Control Records,\t\t\t\t\t\t\t\t\tNuclear Blast
America,\t\t\t\t\t\t\t\t\tNuclear Blast Brasil,\t\t\t\t\t\t\t\t\tNuclear
Blast Entertainment,\t\t\t\t\t\t\t\t\tRadiation
Records,\t\t\t\t\t\t\t\t\tRevolution Entertainment\t\t\t\t\t ",
"Yes", " 5737 Kanan Road #143\r\nAgoura Hills, California 91301 ",
"United States", " N/A ", "active", " Heavy Metal, Extreme Metal "
)), .Names = c("label", "info"), class = c("data.table", "data.frame"
), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x10200db78>)
答案 0 :(得分:1)
广泛数据框的新列名称(例如Address
,Country
等)不会出现在df
中。我们需要向df
添加一列,将info
映射到宽数据框的正确列名,以确保给定行的数据在重新整形后以正确的列结束
我们面临的挑战是,我们需要找到利用数据中规律性的方法,以确定info
代表Genre
,Country
,Year
的哪些值,基于您提供的数据样本,这里有一些初步想法。在下面的代码中,case_when
语句尝试将info
映射到新列名称。按顺序排列,case_when
语句中的语句正在尝试执行以下操作:
Country
Status
(假设它只能是&#34;有效&#34;或&#34;无效&#34;)Genre
。在这里,您需要涵盖更多可能性。Year
。我假设1950-2017范围内任何一个四位数的行代表一年。根据需要进行调整。Phone
。我以为它总是以+
开头,所以你可能需要更复杂的东西。Online
(假设它只能是&#34;是&#34;或&#34;否&#34;,并且不会映射到不同列的行只包含单词&#34;是&#34;或&#34;否&#34;)Sub
。您可能需要更复杂的策略。现在,我假设行包含“&34;记录&#34;或&#34;娱乐&#34;或者有三个或更多逗号的行是Sub
行。您需要使用这些内容并查看数据上下文中的内容。
library(stringr)
library(tidyverse)
library(countrycode)
data("countrycode_data")
df %>%
filter(!grepl("N/A", info)) %>%
mutate(info = str_trim(gsub("\r*\t*|\n*| {2,}", "", info)),
NewCols = case_when(sapply(info, function(x) any(grepl(x, countrycode_data$country.name.en))) ~ "Country",
grepl("active", info) ~ "Status",
grepl("hardcore|metal|rock|classical", info, ignore.case=TRUE) ~ "Genre",
info %in% 1950:2017 ~ "Year",
grepl("^\\+", info) ~ "Phone",
grepl("^Yes$|^No$", info) ~ "Online",
grepl("Records|Entertainment|,{3,}", info) ~ "Sub",
TRUE ~ "Address")) %>%
group_by(label) %>%
spread(NewCols, info)
这是输出(我为了节省空间而截断Sub
的长值):
label Address Country Genre Online Phone Status Sub Year
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 0-22 Records <NA> Poland Hardcore <NA> <NA> active NA <NA>
2 Metal Blade Records 5737 Kanan Road #143Agoura Hills, California 91301 United States Heavy Metal, Extreme Metal <NA> <NA> active NA <NA>
3 Nuclear Blast Oeschstr. 4073072 Donzdorf Germany Hardcore (early), Metal and subgenres Yes +49 7162 9280-0 active Anstalt Re... 1987
原始答案(在数据样本可用之前)
如果每个Label
都有九行,并且每行中的数据类型对于每个Label
的顺序始终相同,那么一个解决方案就是:
library(tidyverse)
df.wide = df %>%
group_by(Label) %>%
mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
您可以在实际数据中为任何具有9行的Label
级别实现此功能。
df.wide9 = df %>%
group_by(Label) %>%
filter(n()==9) %>%
mutate(NewCols = rep(c("Address","Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
对于具有8行或7行的Label
级别,如果缺失的行始终表示相同类型的数据,例如,请说地址行总是缺少8行。 -row级别Label
,然后你可以做(再一次,假设每个Label
的数据数据类型顺序相同):
df.wide8 = df %>%
group_by(Label) %>%
filter(n()==8) %>%
mutate(NewCols = rep(c("Country","Phone","Status","Genre","Year","Sub","Online"), length(unique(Label)))) %>%
spread(NewCols, Info)
然后你可以将它们与df.wide = bind_rows(df.wide8, df.wide9)
放在一起。
如果您提供更多信息,我们可能会提供适合您实际数据的解决方案。