从一列中提取一部分字符串,然后粘贴到新列中

时间:2019-06-11 12:40:31

标签: r string dataframe split

免责声明:R完全没有经验,请耐心等待!...

上下文:目录中有一系列.csv文件。这些文件包含7列和大约100行。我已经编译了一些脚本,这些脚本将读取所有文件,并在每个文件上循环,根据不同的因素添加一些新列(例如,如果特定列引用了“ box set”,那么它将创建一个名为“ box_set”的新列”,每行都带有“是”或“否”),并覆盖原始文件。我唯一不能弄清楚的(是的,我在Google上搜索了高低),是如何基于特定的字符串将一列分为两部分。字符串始终以“:Series”开头,但可以以不同的数字或数字范围结尾。例如。 “ Poldark:系列4”,“火枪手:1-3系列”。

我希望能够将该列(当前称为Programme_Title)拆分为两列(一列称为Programme_Title,一列称为Series_Details)。 Programme_Title将只包含“:”之前的所有内容,而Series_Details将包含从“ S”开始的所有内容。

为了使事情更复杂,Programme_Title列包含许多不同的字符串,并非所有字符串都遵循上述示例。有些不包含“:系列”,有些将包含“:”,但后面不跟“系列”。

因为我很难解释这些事情,所以下面是当前情况的示例:

Programme_Title               
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo: Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur: Series 1-2
Poldark: Series 4
The Musketeers: Series 1-3
War and Peace

这就是我想要的样子:

Programme_Title                                          Series_Details
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo                                                   Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur                                        Series 1-2
Poldark                                                  Series 4
The Musketeers                                           Series 1-3
War and Peace

正如我所说,我是R的新手,所以请想象您正在和5岁的孩子说话。如果您需要更多信息才能回答此问题,请告诉我。

这是我用来做其他所有事情的代码(我确定它有点混乱,但是我从不同的来源将其拼凑在一起,并且可以工作!)

### Read in files ###
filenames = dir(pattern="*.csv")

### Loop through all files, add various columns, then save ###

for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Add date part of filename to column labelled "date" ###
tmp$date <- str_sub(filenames[i], start = 13L, end = -5L)
### Create new column labelled "Series" ###
tmp$Series <- ifelse(grepl(": Series", tmp$Programme_Title), "yes", "no")
### Create "rank" for Programme_Category ###
tmp$rank <- sequence(rle(as.character(tmp$Programme_Category))$lengths)
### Create new column called "row" to assign numerical label to each group ###
DT = data.table(tmp)
tmp <- DT[, row := .GRP, by=.(Programme_Category)][]
### Identify box sets and create new column with "yes" / "no" ###
tmp$Box_Set <- ifelse(grepl("Box Set", tmp$Programme_Synopsis), "yes", "no")
### Remove the data.table which we no longer need ###
rm (DT)
### Write out the new file###
write.csv(tmp, filenames[[i]])
}

2 个答案:

答案 0 :(得分:0)

我没有您确切的数据结构,但是我为您创建了一些可以使用的示例:

library(tidyr)
movieName <- c("This is a test", "This is another test: Series 1-5", "This is yet another test")
df <- data.frame(movieName)
df
                         movieName
1                   This is a test
2 This is another test: Series 1-5
3         This is yet another test
df <- df %>% separate(movieName, c("Title", "Series"), sep= ": Series")

for (row in 1:nrow(df)) {
  df$Series[row] <- ifelse(is.na(df$Series[row]), "", paste("Series", df$Series[row], sep = ""))
}
df
                     Title     Series
1           This is a test           
2     This is another test Series 1-5
3 This is yet another test       

答案 1 :(得分:0)

我试图捕获您可能遇到的所有示例,但是您可以轻松地添加内容以捕获我提供的示例中未涵盖的变体。

编辑:我添加了一个不包含或系列的测试用例。只会为系列详细信息生成NA。

## load library: main ones using are stringr, dplyr, tidry, and tibble from the tidyverse, but I would recommend just installing the tidyverse
library(tidyverse)

## example of your data, hard to know all the unique types of data, but this will get you in the right direction
data <- tibble(title = c("X:Series 1-6",
                         "Y: Series 1-2",
                         "Z : Series 1-10",
                         "The Z and Z: 1-3",
                         "XX Series 1-3",
                         "AA AA"))

## Example of the data we want to format, see the different cases covered
print(data)

  title           
  <chr>           
1 X:Series 1-6    
2 Y: Series 1-2   
3 Z : Series 1-10 
4 The Z and Z: 1-3
5 XX Series 1-3
6 AA AA   

## These %>% are called pipes, and used to feed data through a pipeline, very handy and useful.
data_formatted <- data %>%

  ## Need to fix cases where you have Series but no : or vice versa, this keep everything the same.
  ## Sounds like you will always have either :, series, or :Series If this is different you can easily
  ## change/update this to capture other cases
  mutate(title = case_when(
    str_detect(title,'Series') & !(str_detect(title,':')) ~ str_replace(title,'Series',':Series'),
    !(str_detect(title,'Series')) & (str_detect(title,':')) ~ str_replace(title,':',':Series'),
    TRUE ~ title)) %>% 

  ## first separate the columns based on :
  separate(col = title,into = c("Programme_Title","Series_Details"), sep  = ':') %>% 

  ##This just removes all white space at the ends to clean it up
  mutate(Programme_Title = str_trim(Programme_Title),
         Series_Details = str_trim(Series_Details))

## Output of the data to see how it was formatted
print(data_formatted)

  Programme_Title Series_Details
  <chr>           <chr>         
1 X               Series 1-6    
2 Y               Series 1-2    
3 Z               Series 1-10   
4 The Z and Z     Series 1-3    
5 XX              Series 1-3
6 AA AA           NA