我正在尝试从R中读取一个巨大的csv文件,但我遇到了麻烦,因为假设为字符串格式的列元素没有用引号分隔,并且每次都在那里创建一个新行是一个新的路线。我的数据由〜。
分隔例如,我的数据与此类似:
a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph.
This is a second paragraph.
~ num1 ~ num2 ~
2 ~ name2 ~ This is an new set of paragraph.
~ num1 ~ num2 ~
我希望得到这样的东西:
a | b | c | d | e | ____________________________________________________________________________________ 1 | name1 | This is a paragraph. This is a second paragraph. | num1 | num2 | 2 | name2 | This is a new set of paragraph. | num1 | num2 |
但我最终得到了像这样丑陋的东西:
a | b | c | d | e | __________________________________________________________________________________ 1 | name1 | This is a paragraph. | | | This is a second paragraph | | | | | | num1 | num2 2 | name2 | This is a new set of paragraph. | num1 | num2 |
我试图在read.csv中设置allowEscapes = TRUE但是没有做到这一点。我的输入目前看起来像这样:
read.csv(filename, header = T, sep = '~', stringAsFactors = F, fileEncoding = "latin1", quote = "", strip.white = TRUE)
我的下一个想法是在每个〜之后插入一个引号,但我希望看看是否有更好的方法。
任何帮助都将不胜感激。
答案 0 :(得分:3)
例如:
ll = readLines(textConnection('a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph.
This is a second paragraph.
~ num1 ~ num2 ~
2 ~ name2 ~ This is an new set of paragraph.
~ num1 ~ num2 ~'))
## each line begin with a numeric followed by a space
## I use this pattern to sperate lines
llines <- split(ll[-1],cumsum(grepl('^[0-9] ',ll[-1])))
## add the header to the splitted and concatenated lines
read.table(text=unlist(c(ll[1],lapply(llines,paste,collapse=''))),
sep='~',header=TRUE)
a b c d e
1 name1 This is a paragraph. This is a second paragraph. num1 num2 NA
2 name2 This is an new set of paragraph. num1 num2 NA
答案 1 :(得分:2)
这是R中的一种方法,它依赖于(1)~
是一个真正的分隔符,不会出现在任何段落中,而(2)~
出现在每个记录的末尾
但首先,一些示例数据(其他人也可以重现您的问题)。
cat("a ~ b ~ c ~ d ~ e",
"1 ~ name1 ~ This is a paragraph.",
"",
"This is a second paragraph.",
"",
"~ num1 ~ num2 ~",
"",
"2 ~ name2 ~ This is an new set of paragraph.",
"",
"~ num1 ~ num2 ~", sep = "\n", file = "test.txt")
我们将从readLines
开始获取数据。我们还会在标题行的末尾添加~
。
x <- readLines("test.txt")
x[1] <- paste(x[1], "~") ## Add a ~ at the end of the first line
现在,我们将paste
一切都变成一个很好的长字符串。
y <- paste(x, collapse = " ")
使用scan
再次快速“读取”数据,但不使用file
参数,我们将使用text
参数并引用“y”对象刚刚创建。由于最后一行以~
结尾,因此最后会有一个额外的""
,我们会在继续之前将其移除。
z <- scan(text = y, what = character(), sep = "~", strip.white = TRUE)
# Read 16 items
z <- z[-length(z)]
由于我们现在有了一个字符向量,因此我们可以轻松将其转换为matrix
,然后转换为data.frame
。我们知道colnames
是前5个值,因此我们会在创建matrix
时删除它们,并将它们重新插入data.frame
的名称。
df <- setNames(data.frame(
matrix(z[6:length(z)], ncol = 5, byrow = TRUE)), z[1:5])
df
# a b c d e
# 1 1 name1 This is a paragraph. This is a second paragraph. num1 num2
# 2 2 name2 This is an new set of paragraph. num1 num2
答案 2 :(得分:0)
当我看到这是一个文本处理问题时,我认为Python会更容易。如果您不熟悉它或无法访问它,请道歉:
import csv
all_rows = []
with open('tilded_csv.txt') as in_file:
header_line = next(in_file)
header = header_line.strip().split('~')
current_record = []
for line in in_file:
# Assume that a number at the start of a line
# signals a new record
if line[0].isdigit():
new_record = line.strip()
if current_record:
all_rows.append(current_record.split('~'))
current_record = line.strip()
else:
current_record += line.strip()
# Add the last record
all_rows.append(current_record.split('~'))
with open('standard_csv.csv', 'w') as out_file:
out_csv = csv.writer(out_file, dialect='excel')
out_csv.writerow(header)
for row in all_rows:
out_csv.writerow(row)