我有一个格式为\
的字符串列ID Month
1 Jan;Feb
2 Mar;Apr;Jun;Jul;Aug;Nov
3 Jan;May;Oct;Dec
4 Apr;May;Sep;Oct
我想创建每个月分隔的唯一/不同的12列,例如\
ID | M_1 | M_2 | M_3 | M_4 | M_5 | M_6 | M_7 | M_8 | M_9 | M_10 | M_11 | M_12
1 | Jan | Feb | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA
2 | NA | NA | Mar | Apr | NA | Jun | Jul | Aug | NA | NA | Nov | NA
3 | Jan | NA | NA | NA | May | NA | NA | NA | NA | Oct | NA | Dec
4 | NA | NA | NA | Apr | May | NA | NA | NA | Sep | Oct | NA | NA
如果有人能告诉我在Python中也做同样的事情,将不胜感激。
(对格式错误的道歉)
答案 0 :(得分:4)
使用Series.str.get_dummies
,然后乘以获取列中的名称,而不是1/0的虚拟变量。重命名列并对其重新排序,然后将结果重新加入。
df1 = df['Month'].str.get_dummies(';')
df1 = df1.multiply(df1.columns).replace('', np.NaN)
# Or create the dict manually
mnths = pd.date_range('2010-01-01', '2010-12-31', freq='MS').strftime('%b')
d = {m: f'M_{i}' for m,i in zip(mnths, range(1, len(mnths)+1))}
#{'Jan': 'M_1', 'Feb': 'M_2', 'Mar': 'M_3', ... , 'Nov': 'M_11', 'Dec': 'M_12'}
pd.concat([df[['ID']], df1.reindex(mnths, axis=1).rename(columns=d)], axis=1)
ID M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10 M_11 M_12
0 1 Jan Feb NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 NaN NaN Mar Apr NaN Jun Jul Aug NaN NaN Nov NaN
2 3 Jan NaN NaN NaN May NaN NaN NaN NaN Oct NaN Dec
3 4 NaN NaN NaN Apr May NaN NaN NaN Sep Oct NaN NaN
答案 1 :(得分:3)
这是一个Python解决方案:
第1步:创建月份的配对:
months = ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
months_column = [f"M_{index}" for index, _ in enumerate(months, 1)]
print(months_columns)
['M_1', 'M_2', 'M_3', 'M_4', 'M_5', 'M_6', 'M_7', 'M_8', 'M_9', 'M_10', 'M_11', 'M_12']
#pair the months with the months_column :
mapping = dict(zip(months, months_column))
print(mapping)
{'Jan': 'M_1', 'Feb': 'M_2', 'Mar': 'M_3', 'Apr': 'M_4', 'May': 'M_5', 'Jun': 'M_6',
'Jul': 'M_7', 'Aug': 'M_8', 'Sep': 'M_9', 'Oct': 'M_10', 'Nov': 'M_11', 'Dec': 'M_12'}
第2步:进入熊猫世界:
#import pandas as pd
df = pd.read_csv(filename) #
df["Month"] = df.Month.str.split(";")
#expand each word in the Month column to separate rows
df = df.explode("Month")
df["Month_column"] = df.Month.map(mapping)
#final step is to pivot the dataframe to get your result
df.pivot(index="ID", columns="Month_column", values="Month").reindex(columns=months_column)
答案 2 :(得分:3)
.str.split
分隔;
处的字符串.explode
会将列表值转换为单独的行pandas.get_dummies
将类别变量转换为虚拟变量/指标变量。.groupby
并汇总.sum
,按ID
组合每个月的指标import pandas as pd
import calendar # to get month abbreviations in order
month_abr = list(calendar.month_abbr)
# test data
data = {'ID': [1, 2, 3, 4],
'Month': ['Jan;Feb', 'Mar;Apr;Jun;Jul;Aug;Nov', 'Jan;May;Oct;Dec', 'Apr;May;Sep;Oct']}
# setup dataframe
df = pd.DataFrame(data)
# convert month rows to lists
df.Month = df.Month.str.split(';')
# explode lists into rows
df = df.explode('Month')
# create indicators and groupby sum
dfi = pd.get_dummies(df, prefix='', prefix_sep='').groupby('ID', as_index=False).sum()
# optionally reorder the columns
dfi = dfi[dfi.columns.reindex(['ID'] + month_abr[1:])[0]]
# display(dfi)
ID Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1 1 1 0 0 0 0 0 0 0 0 0 0
1 2 0 0 1 1 0 1 1 1 0 0 1 0
2 3 1 0 0 0 1 0 0 0 0 1 0 1
3 4 0 0 0 1 1 0 0 0 1 1 0 0
pandas.Series.str.get_dummies
和sep
参数。# test data
data = {'ID': [1, 2, 3, 4],
'Month': ['Jan;Feb', 'Mar;Apr;Jun;Jul;Aug;Nov', 'Jan;May;Oct;Dec', 'Apr;May;Sep;Oct']}
# setup dataframe
df = pd.DataFrame(data)
# get indicators and concat with the IDs
dfi = pd.concat([df.ID, df.Month.str.get_dummies(';')], axis=1)
# optionally reorder the columns
dfi = dfi[dfi.columns.reindex(['ID'] + month_abr[1:])[0]]
# display(dfi)
ID Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 1 1 1 0 0 0 0 0 0 0 0 0 0
1 2 0 0 1 1 0 1 1 1 0 0 1 0
2 3 1 0 0 0 1 0 0 0 0 1 0 1
3 4 0 0 0 1 1 0 0 0 1 1 0 0
答案 3 :(得分:2)
我们可以获取长格式的数据,创建带有索引的列,arrange
数据并以宽格式返回。使用dplyr
和tidyr
可以通过以下方式完成:
library(dplyr)
library(tidyr)
df %>%
separate_rows(Month, sep = ";") %>%
mutate(col = paste0('M_', match(Month, month.abb))) %>%
arrange(match(Month, month.abb)) %>%
pivot_wider(names_from = col, values_from = Month) %>%
arrange(ID)
# A tibble: 4 x 13
# ID M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10 M_11 M_12
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 1 Jan Feb NA NA NA NA NA NA NA NA NA NA
#2 2 NA NA Mar Apr NA Jun Jul Aug NA NA Nov NA
#3 3 Jan NA NA NA May NA NA NA NA Oct NA Dec
#4 4 NA NA NA Apr May NA NA NA Sep Oct NA NA
数据
df <- structure(list(ID = 1:4, Month = c("Jan;Feb", "Mar;Apr;Jun;Jul;Aug;Nov",
"Jan;May;Oct;Dec", "Apr;May;Sep;Oct")), class = "data.frame",
row.names = c(NA, -4L))
答案 4 :(得分:1)
这是在R中执行此操作的一种方法。我在分号上拆分了字符串,并使用了月份的命名向量,每列包含NA值。我遍历拆分字符串列表以获取每个值,然后根据名称将NA值替换为字符串值。
df <- data.frame(
ID = c(1L, 2L, 3L, 4L),
Month = c("Jan;Feb","Mar;Apr;Jun;Jul;Aug;Nov",
"Jan;May;Oct;Dec","Apr;May;Sep;Oct")
)
parse_dataframe <- function(string, splt=';'){
splt_string <- strsplit(string, splt)
output <- list()
count = 1
for (i in splt_string){
key <- rep(NA, 12)
names(key) <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
for (month in i){
key[month] <- month
}
output[[count]] <- key
count <- count + 1
}
output <- do.call(rbind, output)
colnames(output) <- paste0('M_', seq(1,12))
return(output)
}
parse_dataframe(df$Month)
这与python类似,其中数据为df,列为“月”。我使用了pandas,因为实际上没有一种原生方法可以通过标准库获取NA值。
import pandas as pd
import calendar
def parse_dataframe(string_series, splt=';'):
str_splt = string_series.str.split(splt)
months = [calendar.month_abbr[x] for x in range(1, 13)]
output = []
for row in str_splt:
output_row = [pd.NA for _ in range(12)]
for month in row:
output_row[months.index(month)] = month
output.append(output_row)
df = pd.DataFrame(output)
df.columns = ['M_' + str(x) for x in range(1, 13)]
return df
parse_dataframe(df['Month'])
答案 5 :(得分:1)
这是一个更紧凑的基础R解决方案:
df <- data.frame(
ID = c(1:4),
Month = c("Jan;Feb","Mar;Apr;Jun;Jul;Aug;Nov",
"Jan;May;Oct;Dec","Apr;May;Sep;Oct")
)
Months <- factor(month.abb, month.abb)
dfm <- do.call(rbind, lapply(strsplit(df$Month, ";"),
function(x) x[Months][match(Months, x)]))
setNames(data.frame(df$ID, dfm), c("ID", paste0("M_", seq_along(Months))))
#> ID M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10 M_11 M_12
#> 1 1 Jan Feb <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 2 <NA> <NA> Mar Apr <NA> Jun Jul Aug <NA> <NA> Nov <NA>
#> 3 3 Jan <NA> <NA> <NA> May <NA> <NA> <NA> <NA> Oct <NA> Dec
#> 4 4 <NA> <NA> <NA> Apr May <NA> <NA> <NA> Sep Oct <NA> <NA>
由reprex package(v0.3.0)于2020-08-08创建