免责声明:R完全没有经验,请耐心等待!...
上下文:目录中有一系列.csv文件。这些文件包含7列和大约100行。我已经编译了一些脚本,这些脚本将读取所有文件,并在每个文件上循环,根据不同的因素添加一些新列(例如,如果特定列引用了“ box set”,那么它将创建一个名为“ box_set”的新列”,每行都带有“是”或“否”),并覆盖原始文件。我唯一不能弄清楚的(是的,我在Google上搜索了高低),是如何基于特定的字符串将一列分为两部分。字符串始终以“:Series”开头,但可以以不同的数字或数字范围结尾。例如。 “ Poldark:系列4”,“火枪手:1-3系列”。
我希望能够将该列(当前称为Programme_Title)拆分为两列(一列称为Programme_Title,一列称为Series_Details)。 Programme_Title将只包含“:”之前的所有内容,而Series_Details将包含从“ S”开始的所有内容。
为了使事情更复杂,Programme_Title列包含许多不同的字符串,并非所有字符串都遵循上述示例。有些不包含“:系列”,有些将包含“:”,但后面不跟“系列”。
因为我很难解释这些事情,所以下面是当前情况的示例:
Programme_Title
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo: Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur: Series 1-2
Poldark: Series 4
The Musketeers: Series 1-3
War and Peace
这就是我想要的样子:
Programme_Title Series_Details
Hidden
Train Surfing Wars: A Matter of Life and Death
Bollywood: The World's Biggest Film Industry
Cuckoo Series 4
Mark Gatiss on John Minton: The Lost Man of British Art
Love and Drugs on the Street
Asian Provocateur Series 1-2
Poldark Series 4
The Musketeers Series 1-3
War and Peace
正如我所说,我是R的新手,所以请想象您正在和5岁的孩子说话。如果您需要更多信息才能回答此问题,请告诉我。
这是我用来做其他所有事情的代码(我确定它有点混乱,但是我从不同的来源将其拼凑在一起,并且可以工作!)
### Read in files ###
filenames = dir(pattern="*.csv")
### Loop through all files, add various columns, then save ###
for (i in 1:length(filenames)) {
tmp <- read.csv(filenames[i], stringsAsFactors = FALSE)
### Add date part of filename to column labelled "date" ###
tmp$date <- str_sub(filenames[i], start = 13L, end = -5L)
### Create new column labelled "Series" ###
tmp$Series <- ifelse(grepl(": Series", tmp$Programme_Title), "yes", "no")
### Create "rank" for Programme_Category ###
tmp$rank <- sequence(rle(as.character(tmp$Programme_Category))$lengths)
### Create new column called "row" to assign numerical label to each group ###
DT = data.table(tmp)
tmp <- DT[, row := .GRP, by=.(Programme_Category)][]
### Identify box sets and create new column with "yes" / "no" ###
tmp$Box_Set <- ifelse(grepl("Box Set", tmp$Programme_Synopsis), "yes", "no")
### Remove the data.table which we no longer need ###
rm (DT)
### Write out the new file###
write.csv(tmp, filenames[[i]])
}
答案 0 :(得分:0)
我没有您确切的数据结构,但是我为您创建了一些可以使用的示例:
library(tidyr)
movieName <- c("This is a test", "This is another test: Series 1-5", "This is yet another test")
df <- data.frame(movieName)
df
movieName
1 This is a test
2 This is another test: Series 1-5
3 This is yet another test
df <- df %>% separate(movieName, c("Title", "Series"), sep= ": Series")
for (row in 1:nrow(df)) {
df$Series[row] <- ifelse(is.na(df$Series[row]), "", paste("Series", df$Series[row], sep = ""))
}
df
Title Series
1 This is a test
2 This is another test Series 1-5
3 This is yet another test
答案 1 :(得分:0)
我试图捕获您可能遇到的所有示例,但是您可以轻松地添加内容以捕获我提供的示例中未涵盖的变体。
编辑:我添加了一个不包含或系列的测试用例。只会为系列详细信息生成NA。
## load library: main ones using are stringr, dplyr, tidry, and tibble from the tidyverse, but I would recommend just installing the tidyverse
library(tidyverse)
## example of your data, hard to know all the unique types of data, but this will get you in the right direction
data <- tibble(title = c("X:Series 1-6",
"Y: Series 1-2",
"Z : Series 1-10",
"The Z and Z: 1-3",
"XX Series 1-3",
"AA AA"))
## Example of the data we want to format, see the different cases covered
print(data)
title
<chr>
1 X:Series 1-6
2 Y: Series 1-2
3 Z : Series 1-10
4 The Z and Z: 1-3
5 XX Series 1-3
6 AA AA
## These %>% are called pipes, and used to feed data through a pipeline, very handy and useful.
data_formatted <- data %>%
## Need to fix cases where you have Series but no : or vice versa, this keep everything the same.
## Sounds like you will always have either :, series, or :Series If this is different you can easily
## change/update this to capture other cases
mutate(title = case_when(
str_detect(title,'Series') & !(str_detect(title,':')) ~ str_replace(title,'Series',':Series'),
!(str_detect(title,'Series')) & (str_detect(title,':')) ~ str_replace(title,':',':Series'),
TRUE ~ title)) %>%
## first separate the columns based on :
separate(col = title,into = c("Programme_Title","Series_Details"), sep = ':') %>%
##This just removes all white space at the ends to clean it up
mutate(Programme_Title = str_trim(Programme_Title),
Series_Details = str_trim(Series_Details))
## Output of the data to see how it was formatted
print(data_formatted)
Programme_Title Series_Details
<chr> <chr>
1 X Series 1-6
2 Y Series 1-2
3 Z Series 1-10
4 The Z and Z Series 1-3
5 XX Series 1-3
6 AA AA NA