我有一个数据集(以.csv文件的形式),其中包含许多列,其中一列包含(电视节目的)“流派”。有多个列(一个用于节目标题,一个用于剧集编号,一个用于剧情简介,等等。)我想创建一个新列,该列连续为“流派”的每个条目编号。例如。因此,纪录片的第一个实例应后跟“ 1”,第二个条目后应跟“ 2”,依此类推。然后,当有新类型时,应从“ 1”开始。如果不清楚,这就是我的意思:
Documentary, 1
Documentary, 2
Documentary, 3
Documentary, 4
Drama, 1
Drama, 2
Drama, 3
Drama, 4
Drama, 5
Sport, 1
Sport, 2
Sport, 3
在有意义的情况下,流派出现的次数会有所不同。我还需要将其应用到数百个.csv文件,因此手动添加此数据不是一种选择!
我想知道是否有人可以建议我该怎么做?我不是最了解数据的人,因此欢迎您使用简单的方法!我对R有所了解,并怀疑您可以通过编写一个包含if / else循环的脚本来做到这一点(例如,如果下一个字段包含与上一个字段相同的内容,请添加1否则从1开始-不好意思的语法,但是您会得到这个想法!)我正在Tableau中可视化此数据,并注意到他们现在有了Tableau Prep-也许可以在其中完成?欢迎任何解决方案!
答案 0 :(得分:1)
在R中有多种方法可以实现。这是使用tidyverse
软件包套件中的函数的一种方法。我们首先按流派分组,然后添加一列,该列从1到流派的脚本数量之间进行计数。根据您的需要,我为新列的外观提供了两种选择。
library(tidyverse)
# Fake data
set.seed(2)
dat = data.frame(genre = sample(c("Drama", "Comedy", "Sport", "Documentary"), 20, replace=TRUE))
# Add columns to number scripts within each genre
dat = dat %>%
group_by(genre) %>%
mutate(count = 1:n(),
count2 = paste0(genre, ", ", 1:n()))
dat
genre count count2 1 Drama 1 Drama, 1 2 Sport 1 Sport, 1 3 Sport 2 Sport, 2 4 Drama 2 Drama, 2 5 Documentary 1 Documentary, 1 6 Documentary 2 Documentary, 2 7 Drama 3 Drama, 3 8 Documentary 3 Documentary, 3 9 Comedy 1 Comedy, 1 10 Sport 3 Sport, 3 11 Sport 4 Sport, 4 12 Drama 4 Drama, 4 13 Documentary 4 Documentary, 4 14 Drama 5 Drama, 5 15 Comedy 2 Comedy, 2 16 Documentary 5 Documentary, 5 17 Documentary 6 Documentary, 6 18 Drama 6 Drama, 6 19 Comedy 3 Comedy, 3 20 Drama 7 Drama, 7
如果您希望对数据进行排序,可以这样做,例如:
dat %>% arrange(genre, count)
genre count count2 1 Comedy 1 Comedy, 1 2 Comedy 2 Comedy, 2 3 Comedy 3 Comedy, 3 4 Documentary 1 Documentary, 1 5 Documentary 2 Documentary, 2 6 Documentary 3 Documentary, 3 7 Documentary 4 Documentary, 4 8 Documentary 5 Documentary, 5 9 Documentary 6 Documentary, 6 10 Drama 1 Drama, 1 11 Drama 2 Drama, 2 12 Drama 3 Drama, 3 13 Drama 4 Drama, 4 14 Drama 5 Drama, 5 15 Drama 6 Drama, 6 16 Drama 7 Drama, 7 17 Sport 1 Sport, 1 18 Sport 2 Sport, 2 19 Sport 3 Sport, 3 20 Sport 4 Sport, 4
答案 1 :(得分:1)
library(dplyr)
library(tidyr)
df <- data.frame(genre = c("Documentary", "Documentary", "Documentary", "Sport", "Sport", "Drama"), rating = c(2,2,4,4,6,6))
df %>% group_by(genre) %>% mutate(id = row_number()) %>% unite(genre_number, c("genre", "id"), sep = " ")
# A tibble: 6 x 2
genre_number rating
<chr> <dbl>
1 Documentary 1 2
2 Documentary 2 2
3 Documentary 3 4
4 Sport 1 4
5 Sport 2 6
6 Drama 1 6
编辑:要处理批处理文件,您可以使任何功能生效并将其应用到文件列表中。
library(dplyr)
library(tidyr)
number_genres <- function(x) {
x %>%
group_by(genre) %>%
mutate(id = row_number()) %>%
unite(genre_number, c("genre", "id"), sep = " ")
}
dir <- "C:/Documents/test" #location of your .csv files
filenames <- list.files(path = dir, pattern = "*.csv", full.names = FALSE) # gets your file names
data_list <- lapply(filenames, read.csv) # reads your files
names(data_list) <- filenames #names your list with respective csv names
numbered <- lapply(data_list, number_genres) # apply your function to your data_list
lapply(1:length(numbered), function(i) write.csv(numbered[[i]],
file = paste0(names(numbered[i])),
row.names = FALSE)) #writes the data to .csv