我是R.的新手。我正在尝试学习基本数据I / O和预处理。我有一个下面给出的格式的文本文件。它是一种非标准格式(与CSV,JSON等不同)我需要将以下结构转换为类似格式的表格(更准确地说是我们从csv文件中获取的数据帧)
输入
product/productId: B000H13270
review/userId: A3J6I70Z9Q0HRX
review/profileName: Lindey H. Magee
review/helpfulness: 1/3
review/score: 5.0
review/time: 1261785600
review/summary: it's fabulous, but *not* from amazon!
review/text: the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
product/productId: B000H13270
review/userId: A1YLOZQKBX3J1S
review/profileName: R. Lee Dailey "Lee_Dailey"
review/helpfulness: 1/4
review/score: 3.0
review/time: 1221177600
review/summary: too expensive
review/text: howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
输出
product/productId | review/UserId ......... | review/text
B000H13270 |A3J6I70Z9Q0HRX | the price on this .... dissapointing!
B000H13270 | A1YLOZQKBX3J1S |howdy y'all,<br /> ..... lee
在Python
我可以通过以下方式执行相同操作
dataFile = open('filename').read().split('\n') # obtain each data chunk
revDict = dict()
for item in dataFile:
stuff = item.split(':')
revDict[stuff[0]].append(stuff[1])
如何在R
中实现类似的目标。 R
答案 0 :(得分:1)
有很多方法可以做到这一点。我是如何使用readLines
,tidyr
和dplyr
完成的:&/ p>
library(dplyr)
library(tidyr)
con <- file("mytxt.txt", "r", blocking = FALSE)
z <- readLines(con)
z <- as.data.frame(z) %>% separate(z, into = c("datatype", "val"), sep=": ") %>%
mutate(rep = cumsum(datatype=="product/productId")) %>%
na.omit() %>%
spread(datatype, val)
您将在数据框中获得输出,如:
rep product/productId review/helpfulness review/profileName review/score
1 1 B000H13270 1/3 Lindey H. Magee 5.0
2 2 B000H13270 1/4 R. Lee Dailey "Lee_Dailey" 3.0
review/summary
1 it's fabulous, but *not* from amazon!
2 too expensive
review/text
1 the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
2 howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
review/time review/userId
1 1261785600 A3J6I70Z9Q0HRX
2 1221177600 A1YLOZQKBX3J1S
答案 1 :(得分:1)
这是一种快速而又脏的方法,可以在冒号上进行分割(除了每行上的第一个以外的所有冒号都从文件中删除),然后将数据从长整数重新整形:
->Admin ->Framework ->Public ->Framework
给出了:
id V2.product/productId V2.review/userId V2.review/profileName V2.review/helpfulness V2.review/score V2.review/time V2.review/summary V2.review/text
1 1 B000H13270 A3J6I70Z9Q0HRX Lindey H. Magee 1/3 5.0 1261785600 it's fabulous, but *not* from amazon! the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
9 2 B000H13270 A1YLOZQKBX3J1S R. Lee Dailey \\"Lee_Dailey\\" 1/4 3.0 1221177600 too expensive howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
假设每个案例由8行组成。
答案 2 :(得分:1)
这是一个'穷人'的方法。
我假设所有数据块都有相同的字段,没有丢失的字段,:
仅用作分隔符。
您有8个字段,在示例中我使用3并简化其名称。
fields <- 3
# you can use file="example.txt" instead text=...
data <- read.table(text="
prod: foo 1
rev1: bar 11
rev2: bar 12
prod: foo 2
rev1: bar 21
rev2: bar 22
", sep=":", strip.white=TRUE, stringsAsFactors=FALSE)
rows <- dim(data)[1]/fields
mdata <- matrix(data$V2, nrow=rows, ncol=fields, byrow=TRUE)
colnames(mdata) <- data$V1[1:fields]
as.data.frame(mdata)
结果:
prod rev1 rev2
1 foo 1 bar 11 bar 12
2 foo 2 bar 21 bar 22