我有一个CSV文件,排列如下:
Year1
Award1,Winner1,Winner2,Winner3...
Award2,Winner4,Winner5,Winner6...
...
Year2
Award1,Winner7,Winner8,Winner9...
如何将此数据重新排列为以下格式,其中第一行是标题?
Year,AwardType,Winner
Year1,Award1,Winner1
Year1,Award1,Winner2
...
Year1,Award2,Winner6
...
Year2,Award1,Winner7
...
我打算在R中做一些相对简单的分析,我认为所需的布局会使数据更容易使用。如果情况并非如此,我会接受其他建议。
谢谢
答案 0 :(得分:3)
这是我的R解决方案。从一些模拟数据开始。您的真实示例将位于文件中。
mockfile <-
"Year1
Award1,Winner1,Winner2,Winner3
Award2,Winner4,Winner5,Winner6
Award3,Winner7,Winner8,Winner9
Year2
Award1,Winner7,Winner8,Winner9
Award2,Winner12,Winner13,Winner14
Award3,Winner15,Winner16,Winner17"
其余的 textConnection(mockfile)
将替换为您案例中的文件名
entries <- count.fields(textConnection(mockfile), sep=",")
blockstart <- which(entries==1)
blocklength <- diff(c(blockstart, length(entries)+1))-1
找到只有一件事的线条,即块的开头。还可以找到块的长度。如果所有块都相同,则可以大大简化这些步骤。
con <- textConnection(mockfile)
# get to first single line
readLines(con, n=blocstart[1]-1)
blocks <- list()
# iterate over blocks
for (i in seq_along(blockstart)) {
# read the single line; that is the year
Year <- readLines(con, n=1L)
# feed the block part to read.csv
rest <- read.csv(text=readLines(con, blocklength[i]), header=FALSE)
rest$Year <- Year
blocks[[i]] <- rest
}
# bind all the blocks together
full <- do.call(rbind, blocks)
# rename the award column
names(full)[1] <- "AwardType"
这给出了一个看起来像
的数据框> full
AwardType V2 V3 V4 Year
1 Award1 Winner1 Winner2 Winner3 Year1
2 Award2 Winner4 Winner5 Winner6 Year1
3 Award3 Winner7 Winner8 Winner9 Year1
4 Award1 Winner7 Winner8 Winner9 Year2
5 Award2 Winner12 Winner13 Winner14 Year2
6 Award3 Winner15 Winner16 Winner17 Year2
要以您想要的方式重塑它,我发现reshape2
包最简单。
library("reshape2")
melt(full, id.vars=c("Year","AwardType"))
给出了
> melt(full, id.vars=c("Year","AwardType"))
Year AwardType variable value
1 Year1 Award1 V2 Winner1
2 Year1 Award2 V2 Winner4
3 Year1 Award3 V2 Winner7
4 Year2 Award1 V2 Winner7
5 Year2 Award2 V2 Winner12
6 Year2 Award3 V2 Winner15
7 Year1 Award1 V3 Winner2
8 Year1 Award2 V3 Winner5
9 Year1 Award3 V3 Winner8
10 Year2 Award1 V3 Winner8
11 Year2 Award2 V3 Winner13
12 Year2 Award3 V3 Winner16
13 Year1 Award1 V4 Winner3
14 Year1 Award2 V4 Winner6
15 Year1 Award3 V4 Winner9
16 Year2 Award1 V4 Winner9
17 Year2 Award2 V4 Winner14
18 Year2 Award3 V4 Winner17
如果您真的不想要,可以删除variable
列。
答案 1 :(得分:1)
以下是R。
中的解决方案d <- read.table("tmp.csv")$V1
result <- list()
year <- "Unknown"
for( line in d ) {
if( grepl(",", line) ) {
line <- strsplit(line, ",")[[1]]
line <- data.frame( year = year, award=line[1], winner=line[-1] )
result <- append( result, list(line) )
} else {
year <- line
}
}
result <- do.call(rbind, result)
答案 2 :(得分:-1)
如果你不介意使用java,你可以使用类似这样的东西
javac Converter.java
java Converter > newdata.csv
这是Converter.java中的代码
public class Converter {
public static void main(String[] args) throws FileNotFoundException {
File file = new File("data.csv");
Scanner scan = new Scanner(file);
String year = null;
System.out.println("Year,AwardType,Winner");
while(scan.hasNext()) {
String line = scan.nextLine();
if (line.length() == 4) {
year = line;
} else {
String[] awardPlusWinners = line.split(",");
for ( int i = 1; i < awardPlusWinners.length; i++) {
System.out.println(year + "," + awardPlusWinners[0] + "," + awardPlusWinners[i]);
}
}
}
}
}