Question

我有很多.txt文件，通常有5列，但有些行有更多，例如：

import json
from lxml import etree

content = """\
<html>
...
</html>"""

tree = etree.XML(content)

h1 = tree.xpath("/html/body/h1[1]")[0]
data = h1.tail
obj = json.loads(data)

我想要做的就是将所有超出第五列的列粘贴在一起。上面的例子应该导致：

a,b,c,d,e
a,b,c,d,e
a,b,c,d,e
a,b,c,d,e,f,g
a,b,c,d,e

我怎么能用R编程？

Answer 1

我假设你已经阅读过你的＆＃34; .csv＆＃34;通过以下文件进入R：

dat <- read.csv(file, header = FALSE, fill = TRUE)

对您提供的数据进行一点测试：

x <- "a,b,c,d,e
      a,b,c,d,e
      a,b,c,d,e
      a,b,c,d,e,f,g
      a,b,c,d,e"

dat <- read.csv(text = x, header = FALSE, fill = TRUE)

#           V1 V2 V3 V4 V5 V6 V7
#1           a  b  c  d  e      
#2           a  b  c  d  e      
#3           a  b  c  d  e      
#4           a  b  c  d  e  f  g
#5           a  b  c  d  e

这可能是另一种可能性吗？

from <- 5
dat[, from] <- do.call(paste, dat[from:ncol(dat)])  ## merge and overwrite
dat[, (from+1):ncol(dat)] <- NULL  ## drop

#           V1 V2 V3 V4    V5
#1           a  b  c  d   e  
#2           a  b  c  d   e  
#3           a  b  c  d   e  
#4           a  b  c  d e f g
#5           a  b  c  d   e

我的简单方法要求您事先知道from;但似乎你确实知道。

Answer 2

我们可以使用readLines读取数据集，将“行”拆分为“{”加入list，找到length的{{1}}的最小值（ 'minLength'），创建一个逻辑条件（'i1'），将'lst'和list的元素组合在一起，大于'minLength'，并使用paste创建一个向量。 / p>

ifelse

注意：这不需要读取数据并检查有多少列。它会自动找到有效列的数量并粘贴其他列。

在我们创建了向量（'v2'）后，我们可以使用lines <- readLines("yourfile.txt") lst <- strsplit(lines, ",") minLength <- min(lengths(lst)) i1 <- lengths(lst) > minLength v1 <- sapply(lst[i1], function(x) paste(x[(minLength+1):length(x)], collapse=" ")) v2 <- ifelse(i1, v1, "")和read.csv

读取'行'

fill = TRUE

或者我们可以使用df1 <- read.csv(text = lines, header = FALSE, fill = TRUE) df1$newCol <- v2直接读取文件，并找到具有第一个NA或“”值的列。当有100列的行有1000行时，很难检查第一个NA或read.csv的开始位置（假设数据集中没有其他NA或""）

""

注意：当我第一次发布时，我使用了df1 <- read.csv("yourfile.txt", header = FALSE, fill = TRUE) i1 <- which.max(colSums(dat=="")!=0) #i1 <- which.max(colSums(is.na(dat))!=0) #if it is NA transform(df1[seq(i1-1)], newCol= do.call(paste, df1[i1:ncol(df1)])) # V1 V2 V3 V4 V5 newCol #1 a b c d e #2 a b c d e #3 a b c d e #4 a b c d e f g #5 a b c d e

另一种方法是使用do.call(paste

count.fields

然后使用i1 <- min(count.fields("yourfile.txt", sep=","))和read.csv/read.table数据读取数据集，如上述方法。

Answer 3

如果您使用的是基于unix的系统，则可以先预先处理该文件将其加载到R（示例文件ff.txt）：

$ paste  -d ',' <(cut -f 1-4 -d ',' ff.txt) <(cut -f 5- -d ',' ff.txt | tr ',' ' ') > ff-mod.txt

写入新文件ff-mod.txt：

$ cat ff-mod.txt 
a,b,c,d,e
a,b,c,d,e
a,b,c,d,e
a,b,c,d,e f g
a,b,c,d,e

该文件可以很容易地读入R：

> read.table('ff-mod.txt', sep=',')
  V1 V2 V3 V4    V5
1  a  b  c  d     e
2  a  b  c  d     e
3  a  b  c  d     e
4  a  b  c  d e f g
5  a  b  c  d     e

将逗号分隔的.txt文件列粘贴在一起

3 个答案: