我正在使用MALLET进行主题分析,它将结果输出到几千行和一百行左右的文本文件(“topics.txt”)中,其中每行包含像这样的制表符分隔变量:
Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc.
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc.
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc.
以下是实际数据的片段:
> dat[1:5,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521
我正在尝试使用 R 将此输出转换为数据表,其中主题是列标题,每个主题包含直接位于右侧的变量“比例”的值每个变量'topic',对于'text'的每个值。像这样:
topic1 topic2 topic3
text1 proportion1 proportion2 proportion3
text2 proportion1 proportion2 proportion3
或使用上面的数据代码段,如下所示:
0 2 7 8 10 12 13 16 18 20 21 23 24 27
10.txt 0 0 0 0 0 0 0 0 0 0.1315621 0.03632624 0.3040853 0 0.4560785
1001.txt 0 0 0 0.1699586 0 0.2099153 0.1692292 0 0 0.2660085 0 0 0 0
1002.txt 0 0.1747023 0 0 0.1360454 0.0750711 0 0.3341721 0 0 0 0 0 0
1003.txt 0.0186709 0 0 0.2255179 0 0.5366148 0 0 0.138856 0 0 0 0 0
1005.txt 0.2214441 0 0.1776052 0 0 0 0 0.2363206 0 0 0 0 0.1914769 0
这是我必须完成这项工作的 R 代码,是由朋友发送的,但它对我不起作用(而且我不太了解它来修复它自己):
##########################################
dat<-read.table("topics.txt", header=F, sep="\t")
datnames<-subset(dat, select=2)
dat2<-subset(dat, select=3:length(dat))
y <- data.frame(topic=character(0),proportion=character(0),text=character(0))
for(i in seq(1, length(dat2), 2)){
z<-i+1
x<-dat2[,i:z]
x<-cbind(x, datnames)
colnames(x)<-c("topic","proportion", "text")
y<-rbind(y, x)
}
# Right at this step at the end of the block
# I get this message that may indicate the problem:
# Error in c(in c("topic", "proportion", "text") : unused argument(s) ("text")
y[is.na(y)] <- 0
xdat<-xtabs(proportion ~ text+topic, data=y)
write.table(xdat, file="topicMatrix.txt", sep="\t", eol = "\n", quote=TRUE, col.names=TRUE, row.names=TRUE)
##########################################
我非常感谢有关如何使此代码正常工作的任何建议。我的问题也可能与this one和this one有关,但我还没有能力立即使用这些问题的答案。
答案 0 :(得分:4)
以下是解决问题的方法之一
dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
"Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3"))
NTOPICS = 3
nam <- c('num', 'text',
paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = ""))
dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long',
sep = "")
reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion')
num text topic1 topic2 topic3
1 Num1 text1 proportion1 proportion2 proportion3
2 Num2 text2 proportion1 proportion2 proportion3
3 Num3 text3 proportion1 proportion2 proportion3
EDIT。无论比例是文本还是数字,这都可以。您还可以修改NTOPICS
以适应您拥有的主题数量
答案 1 :(得分:2)
您可以将其转换为长格式,但需要进一步获取实际数据。 提供数据后已编辑。仍然不确定来自MALLET的整体结构,但至少证明了R功能。这种方法具有“特征”,即如果存在重叠主题,则将比例相加。取决于可能有利的数据布局。
dat <-read.table(textConnection(" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521
"),
header=TRUE)
ldat <- reshape(dat, idvar=1:2, varying=list(topics=c("V3", "V5", "V7", "V9"),
props=c("V4", "V6", "V8", "V10")),
direction="long")
####------------------####
> ldat
V1 V2 time V3 V4
0.10.txt.1 0 10.txt 1 27 0.45607850
1.1001.txt.1 1 1001.txt 1 20 0.26600850
2.1002.txt.1 2 1002.txt 1 16 0.33417210
3.1003.txt.1 3 1003.txt 1 12 0.53661480
4.1005.txt.1 4 1005.txt 1 16 0.23632060
0.10.txt.2 0 10.txt 2 23 0.30408530
1.1001.txt.2 1 1001.txt 2 12 0.20991530
2.1002.txt.2 2 1002.txt 2 2 0.17470230
3.1003.txt.2 3 1003.txt 2 8 0.22551790
4.1005.txt.2 4 1005.txt 2 0 0.22144410
0.10.txt.3 0 10.txt 3 20 0.13156210
1.1001.txt.3 1 1001.txt 3 8 0.16995860
2.1002.txt.3 2 1002.txt 3 10 0.13604540
3.1003.txt.3 3 1003.txt 3 18 0.13885610
4.1005.txt.3 4 1005.txt 3 24 0.19147690
0.10.txt.4 0 10.txt 4 21 0.03632624
1.1001.txt.4 1 1001.txt 4 13 0.16922928
2.1002.txt.4 2 1002.txt 4 12 0.07507119
3.1003.txt.4 3 1003.txt 4 0 0.01867091
4.1005.txt.4 4 1005.txt 4 7 0.17760521
现在可以告诉你如何使用xtabs(),因为那些“比例”是“数字”。这样的事情最终可能就是你想要的。我很惊讶这些主题也是整数,但也许有从主题编号到主题名称的映射?:
> xtabs(V4 ~ V3 + V2, data=ldat)
V2
V3 10.txt 1001.txt 1002.txt 1003.txt 1005.txt
0 0.00000000 0.00000000 0.00000000 0.01867091 0.22144410
2 0.00000000 0.00000000 0.17470230 0.00000000 0.00000000
7 0.00000000 0.00000000 0.00000000 0.00000000 0.17760521
8 0.00000000 0.16995860 0.00000000 0.22551790 0.00000000
10 0.00000000 0.00000000 0.13604540 0.00000000 0.00000000
12 0.00000000 0.20991530 0.07507119 0.53661480 0.00000000
13 0.00000000 0.16922928 0.00000000 0.00000000 0.00000000
16 0.00000000 0.00000000 0.33417210 0.00000000 0.23632060
18 0.00000000 0.00000000 0.00000000 0.13885610 0.00000000
20 0.13156210 0.26600850 0.00000000 0.00000000 0.00000000
21 0.03632624 0.00000000 0.00000000 0.00000000 0.00000000
23 0.30408530 0.00000000 0.00000000 0.00000000 0.00000000
24 0.00000000 0.00000000 0.00000000 0.00000000 0.19147690
27 0.45607850 0.00000000 0.00000000 0.00000000 0.00000000
答案 2 :(得分:2)
回到这个问题,我发现reshape
函数对内存的要求太高,所以我使用了data.table
方法。还有一些步骤,但速度更快,内存密集程度更低。
dat <- read.table(text = "V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521")
dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col
dat <- data.table(dat)
# get document number
docnum <- dat$V1
# get text number
txt <- dat$V2
# remove doc num and text num so we just have topic and props
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL]
# get topic numbers
n <- ncol(dat1)
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)])
# get props
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)])
# put topics and props together
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i]))
names(tp) <- txt
# make into long table
dt <- data.table::rbindlist(tp)
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2)))
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2)))
# reshape to wide
library(data.table)
setkey(dt, tops...i., doc)
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.]
setnames(out, c("topic", as.character(txt)))
# transpose to have table of docs (rows) and columns (topics)
tout <- data.table(t(out))
setnames(tout, unname(as.character(tout[1,])))
tout <- tout[-1,]
row.names(tout) <- txt
# replace NA with zero
tout[is.na(tout)] <- 0
这是输出,文档为行,主题为列,文档名称位于rownames中,不会打印,但可供以后使用。
tout
0 2 7 8 10 12 13 16 18
1: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
2: 0.00000000 0.0000000 0.0000000 0.1699586 0.0000000 0.20991530 0.1692293 0.0000000 0.0000000
3: 0.00000000 0.1747023 0.0000000 0.0000000 0.1360454 0.07507119 0.0000000 0.3341721 0.0000000
4: 0.01867091 0.0000000 0.0000000 0.2255179 0.0000000 0.53661480 0.0000000 0.0000000 0.1388561
5: 0.22144410 0.0000000 0.1776052 0.0000000 0.0000000 0.00000000 0.0000000 0.2363206 0.0000000
20 21 23 24 27
1: 0.1315621 0.03632624 0.3040853 0.0000000 0.4560785
2: 0.2660085 0.00000000 0.0000000 0.0000000 0.0000000
3: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
4: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000
5: 0.0000000 0.00000000 0.0000000 0.1914769 0.0000000