在文本中的“[”之前提取文本

时间:2018-06-01 05:50:14

标签: r regex

我有一个包含产品信息的文本文件,如下所示

示例文件

  Id:   513954
    ASIN: 0789447096
      title: 1,000 Makers of the Millennium: The Men and Women Who Have Shaped the Last 1,000 Years
      group: Book
      salesrank: 831366
      similar: 0
      categories: 1
       |Books[283155]|Subjects[1000]|Children's Books[4]|Ages 9-12[2786]|General[170063]
      reviews: total: 2  downloaded: 2  avg rating: 3.5
        2000-4-3  cutomer: A1PN4N5OC3GET7  rating: 5  votes:  11  helpful:   6
        2000-11-25  cutomer: A18C5AJ277PFVO  rating: 2  votes:  14  helpful:  14

期望的结果

ASIN             V1  V2       V3                       V4             V5   V6
0789447096     Books Subjects Religion & Spirituality Christianity Clergy Preaching

到目前为止

 library(stringr)    
    line <- "|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]"
       x <- str_split(line, fixed("|"))
ans <- gsub("[[:punct:]]", " ", x)
ans <- gsub("\\d", "",ans)

[1]  "c      Books      Subjects      Religion   Spirituality      Christianity      Clergy      Preaching    "

结果中额外的“c”! 此外,当我尝试使用实际文件时,这不起作用。怎么做?

products <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/amazon.txt")

1 个答案:

答案 0 :(得分:1)

我在现有代码中做了两处小改动。

你必须逃避&#34; |&#34; (管道)符号,因为它意味着&#34; &#34;在正则表达式中。一旦你逃脱它。你可以从字面上理解它。删除&#34; [或]&#34; ,你必须做同样的事情,你必须再次逃避它们(它是正则表达式中字符类的表示)。

所以你可以这样做:

    options(stringsAsFactors=F)
    rd <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/amazon.txt")
rd1 <- data.frame(var = rd[grepl("\\||ASIN",rd)])

rd1$asin <- ifelse(grepl("ASIN", rd1$var)==T, rd1$var,NA)

rd1$asin <- zoo::na.locf(rd1$asin)

rd1 <- rd1[grepl("ASIN", rd1$var)==F,]

newlyst <- str_split(rd1[,"var"], "\\|") #updated line
max_len <- max(lengths(newlyst)) #updated line
newdf_inp <- lapply(newlyst, `length<-`, max_len) #updated line


df <- data.frame(do.call('rbind', newdf_inp)) #updated line
#head(df)

ans <- data.frame(sapply(df, function(x)gsub("\\[\\d+\\]", "", x)))
#head(ans)

ans1 <- sapply(ans, function(x) all(trimws(x)==""))
cbind(ASIN = rd1[,'asin'], ans[,!ans1]) ##Final Answer

<强>输出

> head(cbind(ASIN = rd1[,'asin'], ans[,!ans1]),2)
              ASIN    X2       X3               X4                 X5
1 ASIN: 0789447096 Books Subjects Children's Books          Ages 9-12
2 ASIN: 1582450099 Books Subjects    Home & Garden Animal Care & Pets
       X6     X7
1 General       
2    Dogs Breeds

更新了输出:

> head(cbind(ASIN = rd1[,'asin'], ans[,!ans1])) ##Final Answer
              ASIN    X2               X3                      X4
1 ASIN: 0827229534 Books         Subjects Religion & Spirituality
2 ASIN: 0827229534 Books         Subjects Religion & Spirituality
3 ASIN: B00001ZSVK Music           Styles                     Pop
4 ASIN: B00001ZSVK Music           Styles                     R&B
5 ASIN: B00001ZSVK Music           Styles    Broadway & Vocalists
6 ASIN: B00001ZSVK Music Specialty Stores             Indie Music
                    X5      X6        X7   X8   X9  X10  X11
1         Christianity  Clergy Preaching <NA> <NA> <NA> <NA>
2         Christianity  Clergy   Sermons <NA> <NA> <NA> <NA>
3              General    <NA>      <NA> <NA> <NA> <NA> <NA>
4              General    <NA>      <NA> <NA> <NA> <NA> <NA>
5             Musicals General      <NA> <NA> <NA> <NA> <NA>
6 Broadway & Vocalists    <NA>      <NA> <NA> <NA> <NA> <NA>