将结构化文本文件但非标准结构转换为R

时间:2015-09-24 02:22:52

标签: r

我是R.的新手。我正在尝试学习基本数据I / O和预处理。我有一个下面给出的格式的文本文件。它是一种非标准格式(与CSV,JSON等不同)我需要将以下结构转换为类似格式的表格(更准确地说是我们从csv文件中获取的数据帧)

输入

product/productId: B000H13270
review/userId: A3J6I70Z9Q0HRX
review/profileName: Lindey H. Magee
review/helpfulness: 1/3
review/score: 5.0
review/time: 1261785600
review/summary: it's fabulous, but *not* from amazon!
review/text: the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!

product/productId: B000H13270
review/userId: A1YLOZQKBX3J1S
review/profileName: R. Lee Dailey "Lee_Dailey"
review/helpfulness: 1/4
review/score: 3.0
review/time: 1221177600
review/summary: too expensive
review/text: howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee

输出

product/productId | review/UserId ......... | review/text
B000H13270        |A3J6I70Z9Q0HRX           |  the price on this .... dissapointing!
B000H13270       | A1YLOZQKBX3J1S          |howdy y'all,<br /> ..... lee

Python我可以通过以下方式执行相同操作

dataFile = open('filename').read().split('\n') # obtain each data chunk
revDict = dict()
for item in dataFile:
    stuff = item.split(':')
    revDict[stuff[0]].append(stuff[1])

如何在R中实现类似的目标。 R

中是否有任何等价物

3 个答案:

答案 0 :(得分:1)

有很多方法可以做到这一点。我是如何使用readLinestidyrdplyr完成的:&/ p>

library(dplyr)
library(tidyr)
con <- file("mytxt.txt", "r", blocking = FALSE)
z <- readLines(con)
z <- as.data.frame(z) %>% separate(z, into = c("datatype", "val"), sep=": ") %>%
         mutate(rep = cumsum(datatype=="product/productId")) %>% 
         na.omit() %>%
         spread(datatype, val)

您将在数据框中获得输出,如:

  rep product/productId review/helpfulness         review/profileName review/score
1   1        B000H13270                1/3            Lindey H. Magee          5.0
2   2        B000H13270                1/4 R. Lee Dailey "Lee_Dailey"          3.0
                         review/summary
1 it's fabulous, but *not* from amazon!
2                         too expensive
                                                                                                                                                                                                                                                                                                                                                                                                      review/text
1                                                                                                                                                                                                                               the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
2 howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
  review/time  review/userId
1  1261785600 A3J6I70Z9Q0HRX
2  1221177600 A1YLOZQKBX3J1S

答案 1 :(得分:1)

这是一种快速而又脏的方法,可以在冒号上进行分割(除了每行上的第一个以外的所有冒号都从文件中删除),然后将数据从长整数重新整形:


     ->Admin
      ->Framework
     ->Public
      ->Framework

给出了:

  id V2.product/productId V2.review/userId           V2.review/profileName  V2.review/helpfulness V2.review/score V2.review/time                      V2.review/summary                                                                                                                                                                                                                                                                                                                                                                                                    V2.review/text
1  1           B000H13270   A3J6I70Z9Q0HRX                 Lindey H. Magee                   1/3             5.0     1261785600  it's fabulous, but *not* from amazon!                                                                                                                                                                                                                                the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
9  2           B000H13270   A1YLOZQKBX3J1S  R. Lee Dailey \\"Lee_Dailey\\"                   1/4             3.0     1221177600                          too expensive  howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee

假设每个案例由8行组成。

答案 2 :(得分:1)

这是一个'穷人'的方法。

我假设所有数据块都有相同的字段,没有丢失的字段,:仅用作分隔符。

您有8个字段,在示例中我使用3并简化其名称。

fields <- 3

# you can use file="example.txt" instead text=...
data <- read.table(text="
    prod: foo  1 
    rev1: bar 11
    rev2: bar 12

    prod: foo  2
    rev1: bar 21
    rev2: bar 22
  ", sep=":", strip.white=TRUE, stringsAsFactors=FALSE)

rows <- dim(data)[1]/fields

mdata <- matrix(data$V2, nrow=rows, ncol=fields, byrow=TRUE)

colnames(mdata) <- data$V1[1:fields]

as.data.frame(mdata)

结果:

     prod    rev1    rev2
1  foo  1  bar 11  bar 12
2  foo  2  bar 21  bar 22