Question

我有一个由单个整数标识的单个矢量项列表文件中的项目列表。我还有每个项目的元数据。在这种情况下，该项目是Amazon.com上的一本书，元数据具有如下所列的各种属性。对于我的项目列表中的每本书，我想获得它的标题，组，销售排名和其他一些。元数据包含其他组的数据，如DVD，但我不需要这些，并希望跳过它们。在元数据文件中，每个项及其属性以“ID：”开头，以空行结束。我在R中尝试了一堆工具但没有取得多大成功。并希望有人可以提供帮助。

以下是元数据文件的摘录，包括2本书（ID：9，ID：10）。

Id:   9
ASIN: 1859677800
  title: Making Bread: The Taste of Traditional Home-Baking
  group: Book
  salesrank: 949166
  similar: 0
  categories: 1
   |Books[283155]|Subjects[1000]|Cooking, Food & Wine[6]|Baking[4196]|Bread[4197]
  reviews: total: 0  downloaded: 0  avg rating: 0

Id:   10
ASIN: 0375709363
  title: The Edward Said Reader
  group: Book
  salesrank: 220379
  similar: 5  039474067X  0679730672  0679750541  1400030668  0896086704
  categories: 3
   |Books[283155]|Subjects[1000]|Literature & Fiction[17]|History & Criticism[10204]|Criticism & Theory[10207]|General[10213]
   |Books[283155]|Subjects[1000]|Nonfiction[53]|Politics[11079]|History & Theory[11086]
   |Books[283155]|Subjects[1000]|Nonfiction[53]|Social Sciences[11232]|Anthropology[11233]|Cultural[11235]
  reviews: total: 6  downloaded: 6  avg rating: 4
    2000-10-8  cutomer: A2RI73IFW2GWU1  rating: 4  votes:  12  helpful:   7
    2001-5-4  cutomer: A1GE54WF2WUZ2X  rating: 5  votes:  11  helpful:   8
    2001-8-27  cutomer: A36S399V1VC4DR  rating: 4  votes:   5  helpful:   3
    2002-1-26  cutomer: A280GY5UVUS2QH  rating: 3  votes:  12  helpful:   7
    2004-4-7  cutomer: A2YHZJIU4L4IOI  rating: 4  votes:  10  helpful:   2
    2004-4-27  cutomer: A1MB83EO48TRSC  rating: 4  votes:   5  helpful:   3

Answer 1

假设发布的数据位于名为myfile.txt的文本文件中，请将其减少到可以使用的那些行，然后解析它以生成长格式数据。添加grp列，用于标识同一ID中的字段。（可选）在reshape2包中使用dcast将其从长到大的形式重新整形：

library(reshape2)

L <- readLines("myfile.txt")

# add other fields to the regular expression as needed
ok <- grep("^Id:|^ *title:|^ *group:", L, value = TRUE)

# create data frame in long form
long <- data.frame(lab = gsub("^ *|:.*", "", ok), value = sub("^.*?: ", "", ok))
long$grp <- cumsum(long$lab == "Id")

# optionally reshape it into wide form
wide <- dcast(grp ~ lab, data = long)

最后一行给出：

> wide
  grp group   Id                                title
1   1  Book    9 The Taste of Traditional Home-Baking
2   2  Book   10               The Edward Said Reader

Answer 2

如果使用readLines，您可以将这些数据作为长字符串输入R：

z <- readLines("example-text.txt")

然后，您可以使用此初始读取来使用scan分别读取每条记录，或将该记录拆分为行。例如：

idpos <- grep("Id",z)
scan("example-text.txt", skip=idpos[1]-1, nlines=idpos[2]-idpos[1], what="character",sep="\n")
scan("example-text.txt", skip=idpos[2]-1, nlines=length(z)-idpos[2], what="character",sep="\n")

然后，您可以通过各种方式解析这些字符串，将它们转换为另一种数据结构。

将文本数据读入R中

2 个答案: