R / Python-将字符串列拆分为多个不同的列

时间:2020-08-08 22:28:15

标签: r python-3.x pandas dataframe

我有一个格式为\

的字符串列
ID  Month 
1   Jan;Feb 
2   Mar;Apr;Jun;Jul;Aug;Nov 
3   Jan;May;Oct;Dec 
4   Apr;May;Sep;Oct 

我想创建每个月分隔的唯一/不同的12列,例如\

ID | M_1  | M_2 | M_3 | M_4 | M_5 | M_6 | M_7 | M_8 | M_9 | M_10 | M_11 | M_12 
1  | Jan  | Feb | NA  | NA  | NA  | NA  | NA  | NA  | NA  | NA   | NA   | NA 
2  |  NA   | NA  | Mar | Apr | NA  | Jun | Jul | Aug | NA  | NA   | Nov  | NA 
3  | Jan  | NA  | NA  | NA  | May | NA  | NA  | NA  | NA  | Oct  | NA   | Dec 
4  | NA   | NA  | NA  | Apr | May | NA  | NA  | NA  | Sep | Oct  | NA   | NA 

如果有人能告诉我在Python中也做同样的事情,将不胜感激。

(对格式错误的道歉)

6 个答案:

答案 0 :(得分:4)

使用Series.str.get_dummies,然后乘以获取列中的名称,而不是1/0的虚拟变量。重命名列并对其重新排序,然后将结果重新加入。

df1 = df['Month'].str.get_dummies(';')
df1 = df1.multiply(df1.columns).replace('', np.NaN)

# Or create the dict manually
mnths = pd.date_range('2010-01-01', '2010-12-31', freq='MS').strftime('%b')
d = {m: f'M_{i}' for m,i in zip(mnths, range(1, len(mnths)+1))}
#{'Jan': 'M_1', 'Feb': 'M_2', 'Mar': 'M_3', ... , 'Nov': 'M_11', 'Dec': 'M_12'}    

pd.concat([df[['ID']], df1.reindex(mnths, axis=1).rename(columns=d)], axis=1)

   ID  M_1  M_2  M_3  M_4  M_5  M_6  M_7  M_8  M_9 M_10 M_11 M_12
0   1  Jan  Feb  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1   2  NaN  NaN  Mar  Apr  NaN  Jun  Jul  Aug  NaN  NaN  Nov  NaN
2   3  Jan  NaN  NaN  NaN  May  NaN  NaN  NaN  NaN  Oct  NaN  Dec
3   4  NaN  NaN  NaN  Apr  May  NaN  NaN  NaN  Sep  Oct  NaN  NaN

答案 1 :(得分:3)

这是一个Python解决方案:

第1步:创建月份的配对:

months = ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')

months_column = [f"M_{index}" for index, _ in enumerate(months, 1)]

print(months_columns)
['M_1', 'M_2', 'M_3', 'M_4', 'M_5', 'M_6', 'M_7', 'M_8', 'M_9', 'M_10', 'M_11', 'M_12']

#pair the months with the months_column : 
mapping = dict(zip(months, months_column))
print(mapping)
{'Jan': 'M_1', 'Feb': 'M_2', 'Mar': 'M_3', 'Apr': 'M_4', 'May': 'M_5', 'Jun': 'M_6',
 'Jul': 'M_7', 'Aug': 'M_8', 'Sep': 'M_9', 'Oct': 'M_10', 'Nov': 'M_11', 'Dec': 'M_12'}

第2步:进入熊猫世界:

  #import pandas as pd

   df = pd.read_csv(filename) #
  
   df["Month"] = df.Month.str.split(";")
   #expand each word in the Month column to separate rows
   df = df.explode("Month")
   df["Month_column"] = df.Month.map(mapping)
   #final step is to pivot the dataframe to get your result
   df.pivot(index="ID", columns="Month_column", values="Month").reindex(columns=months_column)

答案 2 :(得分:3)

  • 您具有指定的输出,但我将其作为替代。
  • 使用.str.split分隔;处的字符串
  • 使用.explode会将列表值转换为单独的行
  • 使用pandas.get_dummies将类别变量转换为虚拟变量/指标变量。
  • 使用.groupby并汇总.sum,按ID组合每个月的指标
import pandas as pd
import calendar  # to get month abbreviations in order

month_abr = list(calendar.month_abbr)

# test data
data = {'ID': [1, 2, 3, 4],
        'Month': ['Jan;Feb', 'Mar;Apr;Jun;Jul;Aug;Nov', 'Jan;May;Oct;Dec', 'Apr;May;Sep;Oct']}

# setup dataframe
df = pd.DataFrame(data)

# convert month rows to lists
df.Month = df.Month.str.split(';')

# explode lists into rows
df = df.explode('Month')

# create indicators and groupby sum
dfi = pd.get_dummies(df, prefix='', prefix_sep='').groupby('ID', as_index=False).sum()

# optionally reorder the columns
dfi = dfi[dfi.columns.reindex(['ID'] + month_abr[1:])[0]]

# display(dfi)
   ID  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
0   1    1    1    0    0    0    0    0    0    0    0    0    0
1   2    0    0    1    1    0    1    1    1    0    0    1    0
2   3    1    0    0    0    1    0    0    0    0    1    0    1
3   4    0    0    0    1    1    0    0    0    1    1    0    0

更简洁

# test data
data = {'ID': [1, 2, 3, 4],
        'Month': ['Jan;Feb', 'Mar;Apr;Jun;Jul;Aug;Nov', 'Jan;May;Oct;Dec', 'Apr;May;Sep;Oct']}

# setup dataframe
df = pd.DataFrame(data)

# get indicators and concat with the IDs
dfi = pd.concat([df.ID, df.Month.str.get_dummies(';')], axis=1)

# optionally reorder the columns
dfi = dfi[dfi.columns.reindex(['ID'] + month_abr[1:])[0]]

# display(dfi)
   ID  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
0   1    1    1    0    0    0    0    0    0    0    0    0    0
1   2    0    0    1    1    0    1    1    1    0    0    1    0
2   3    1    0    0    0    1    0    0    0    0    1    0    1
3   4    0    0    0    1    1    0    0    0    1    1    0    0

答案 3 :(得分:2)

我们可以获取长格式的数据,创建带有索引的列,arrange数据并以宽格式返回。使用dplyrtidyr可以通过以下方式完成:

library(dplyr)  
library(tidyr)

df %>%
    separate_rows(Month, sep = ";") %>%
    mutate(col = paste0('M_', match(Month, month.abb))) %>%
    arrange(match(Month, month.abb)) %>%
    pivot_wider(names_from = col, values_from = Month) %>%
    arrange(ID)

# A tibble: 4 x 13
#     ID M_1   M_2   M_3   M_4   M_5   M_6   M_7   M_8   M_9   M_10  M_11  M_12 
#  <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1     1 Jan   Feb   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA   
#2     2 NA    NA    Mar   Apr   NA    Jun   Jul   Aug   NA    NA    Nov   NA   
#3     3 Jan   NA    NA    NA    May   NA    NA    NA    NA    Oct   NA    Dec  
#4     4 NA    NA    NA    Apr   May   NA    NA    NA    Sep   Oct   NA    NA   

数据

df <- structure(list(ID = 1:4, Month = c("Jan;Feb", "Mar;Apr;Jun;Jul;Aug;Nov", 
"Jan;May;Oct;Dec", "Apr;May;Sep;Oct")), class = "data.frame", 
row.names = c(NA, -4L))

答案 4 :(得分:1)

这是在R中执行此操作的一种方法。我在分号上拆分了字符串,并使用了月份的命名向量,每列包含NA值。我遍历拆分字符串列表以获取每个值,然后根据名称将NA值替换为字符串值。

df <- data.frame(
            ID = c(1L, 2L, 3L, 4L),
         Month = c("Jan;Feb","Mar;Apr;Jun;Jul;Aug;Nov",
                   "Jan;May;Oct;Dec","Apr;May;Sep;Oct")
  )
  
  
  parse_dataframe <- function(string, splt=';'){
    splt_string <- strsplit(string, splt)
    output <- list()
    count = 1
    for (i in splt_string){
      key <- rep(NA, 12)
      names(key) <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
      for (month in i){
        key[month] <- month
      }
      output[[count]] <- key
      count <- count + 1
    }
    output <- do.call(rbind, output)
    colnames(output) <- paste0('M_', seq(1,12))
    return(output)
  }
  
  parse_dataframe(df$Month)

这与python类似,其中数据为df,列为“月”。我使用了pandas,因为实际上没有一种原生方法可以通过标准库获取NA值。

import pandas as pd
import calendar 

def parse_dataframe(string_series, splt=';'):
    str_splt = string_series.str.split(splt)
    months = [calendar.month_abbr[x] for x in range(1, 13)]
    output = []

    for row in str_splt:
        output_row = [pd.NA for _ in range(12)]
        for month in row:
            output_row[months.index(month)] = month

        output.append(output_row)
    df = pd.DataFrame(output)
    df.columns = ['M_' + str(x) for x in range(1, 13)]

    return df

parse_dataframe(df['Month'])

答案 5 :(得分:1)

这是一个更紧凑的基础R解决方案:

df <- data.frame(
  ID = c(1:4),
  Month = c("Jan;Feb","Mar;Apr;Jun;Jul;Aug;Nov",
            "Jan;May;Oct;Dec","Apr;May;Sep;Oct")
)

Months <- factor(month.abb, month.abb)
dfm <- do.call(rbind, lapply(strsplit(df$Month, ";"), 
  function(x) x[Months][match(Months, x)]))
setNames(data.frame(df$ID, dfm), c("ID", paste0("M_", seq_along(Months))))
#>   ID  M_1  M_2  M_3  M_4  M_5  M_6  M_7  M_8  M_9 M_10 M_11 M_12
#> 1  1  Jan  Feb <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2  2 <NA> <NA>  Mar  Apr <NA>  Jun  Jul  Aug <NA> <NA>  Nov <NA>
#> 3  3  Jan <NA> <NA> <NA>  May <NA> <NA> <NA> <NA>  Oct <NA>  Dec
#> 4  4 <NA> <NA> <NA>  Apr  May <NA> <NA> <NA>  Sep  Oct <NA> <NA>

reprex package(v0.3.0)于2020-08-08创建