提取迭代字符串并将其分配给列

时间:2018-06-27 10:34:19

标签: r regex string extract

在数据集中,我只有一列是不整齐的,而Column1包含如此多的行以日期开头。示例如下:

Column1
date: 28-Oct-2017
company: BB KISS
classification: Software
roundsize: 1.2
cumulative: 1.2
round: Seed
investors: Private
headquartered: Darmstadt
country: Germany
region: DACH
description: Software development for crypto currency and blockchain 
url: https://bbkiss.de/

要在“:”之后提取

df$extract <- sub('.*:', '', df$Column1)

我想将日期,公司,分类以及相对其他的内容分配给新列。如下所示:

date          company  classification  roundsize  cumulative  round ...
28-Oct-2017   BB KISS  Software        1.2        1.2         Seed  ...

该怎么做?

2 个答案:

答案 0 :(得分:1)

您可以将其与{tidyr}分开传播:

tab <- tibble::tribble(
  ~ column1, 
  "date: 28-Oct-2017",
  "company: BB KISS",
  "classification: Software",
  "roundsize: 1.2",
  "cumulative: 1.2"
)
library(tidyr)
tab %>% 
  separate(column1, into = c("A", "B"), sep = ": ") %>%
  spread(key = A, value = B)
#> # A tibble: 1 x 5
#>   classification company cumulative date        roundsize
#>   <chr>          <chr>   <chr>      <chr>       <chr>    
#> 1 Software       BB KISS 1.2        28-Oct-2017 1.2

答案 1 :(得分:0)

我创建了一个由2个(相同)公司组成的示例数据集。您可以使用tidyr和dplyr使所有功能正常工作。您需要创建一个ID,以确保传播有效。

library(tidyr)
library(dplyr)

df_new <- df1 %>% 
  separate(Column1, into = c("cols", "data"), sep = ": ") %>% 
  group_by(cols) %>%
  mutate(id = row_number()) %>% # create id per company
  spread(cols, data)

df_new
# A tibble: 2 x 13
     id classification company country cumulative date        description           headquartered investors region round roundsize url    
  <int> <chr>          <chr>   <chr>   <chr>      <chr>       <chr>                 <chr>         <chr>     <chr>  <chr> <chr>     <chr>  
1     1 Software       BB KISS Germany 1.2        28-Oct-2017 "Software developmen~ Darmstadt     Private   DACH   Seed  1.2       https:~
2     2 Software       BB KISS Germany 1.2        28-Oct-2017 "Software developmen~ Darmstadt     Private   DACH   Seed  1.2       https:~

数据:

df1 <-  dput(df1)
structure(list(Column1 = c("date: 28-Oct-2017", "company: BB KISS", 
"classification: Software", "roundsize: 1.2", "cumulative: 1.2", 
"round: Seed", "investors: Private", "headquartered: Darmstadt", 
"country: Germany", "region: DACH", "description: Software development for crypto currency and blockchain ", 
"url: https://bbkiss.de/", "date: 28-Oct-2017", "company: BB KISS", 
"classification: Software", "roundsize: 1.2", "cumulative: 1.2", 
"round: Seed", "investors: Private", "headquartered: Darmstadt", 
"country: Germany", "region: DACH", "description: Software development for crypto currency and blockchain ", 
"url: https://bbkiss.de/")), class = "data.frame", row.names = c(NA, 
-24L))