我有一个包含产品信息的文本文件,如下所示
示例文件
Id: 513954
ASIN: 0789447096
title: 1,000 Makers of the Millennium: The Men and Women Who Have Shaped the Last 1,000 Years
group: Book
salesrank: 831366
similar: 0
categories: 1
|Books[283155]|Subjects[1000]|Children's Books[4]|Ages 9-12[2786]|General[170063]
reviews: total: 2 downloaded: 2 avg rating: 3.5
2000-4-3 cutomer: A1PN4N5OC3GET7 rating: 5 votes: 11 helpful: 6
2000-11-25 cutomer: A18C5AJ277PFVO rating: 2 votes: 14 helpful: 14
期望的结果
ASIN V1 V2 V3 V4 V5 V6
0789447096 Books Subjects Religion & Spirituality Christianity Clergy Preaching
到目前为止
library(stringr)
line <- "|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]"
x <- str_split(line, fixed("|"))
ans <- gsub("[[:punct:]]", " ", x)
ans <- gsub("\\d", "",ans)
[1] "c Books Subjects Religion Spirituality Christianity Clergy Preaching "
结果中额外的“c”! 此外,当我尝试使用实际文件时,这不起作用。怎么做?
products <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/amazon.txt")
答案 0 :(得分:1)
我在现有代码中做了两处小改动。
你必须逃避&#34; |&#34; (管道)符号,因为它意味着&#34; 或&#34;在正则表达式中。一旦你逃脱它。你可以从字面上理解它。删除&#34; [或]&#34; ,你必须做同样的事情,你必须再次逃避它们(它是正则表达式中字符类的表示)。
所以你可以这样做:
options(stringsAsFactors=F)
rd <- readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/amazon.txt")
rd1 <- data.frame(var = rd[grepl("\\||ASIN",rd)])
rd1$asin <- ifelse(grepl("ASIN", rd1$var)==T, rd1$var,NA)
rd1$asin <- zoo::na.locf(rd1$asin)
rd1 <- rd1[grepl("ASIN", rd1$var)==F,]
newlyst <- str_split(rd1[,"var"], "\\|") #updated line
max_len <- max(lengths(newlyst)) #updated line
newdf_inp <- lapply(newlyst, `length<-`, max_len) #updated line
df <- data.frame(do.call('rbind', newdf_inp)) #updated line
#head(df)
ans <- data.frame(sapply(df, function(x)gsub("\\[\\d+\\]", "", x)))
#head(ans)
ans1 <- sapply(ans, function(x) all(trimws(x)==""))
cbind(ASIN = rd1[,'asin'], ans[,!ans1]) ##Final Answer
<强>输出强>:
> head(cbind(ASIN = rd1[,'asin'], ans[,!ans1]),2)
ASIN X2 X3 X4 X5
1 ASIN: 0789447096 Books Subjects Children's Books Ages 9-12
2 ASIN: 1582450099 Books Subjects Home & Garden Animal Care & Pets
X6 X7
1 General
2 Dogs Breeds
更新了输出:
> head(cbind(ASIN = rd1[,'asin'], ans[,!ans1])) ##Final Answer
ASIN X2 X3 X4
1 ASIN: 0827229534 Books Subjects Religion & Spirituality
2 ASIN: 0827229534 Books Subjects Religion & Spirituality
3 ASIN: B00001ZSVK Music Styles Pop
4 ASIN: B00001ZSVK Music Styles R&B
5 ASIN: B00001ZSVK Music Styles Broadway & Vocalists
6 ASIN: B00001ZSVK Music Specialty Stores Indie Music
X5 X6 X7 X8 X9 X10 X11
1 Christianity Clergy Preaching <NA> <NA> <NA> <NA>
2 Christianity Clergy Sermons <NA> <NA> <NA> <NA>
3 General <NA> <NA> <NA> <NA> <NA> <NA>
4 General <NA> <NA> <NA> <NA> <NA> <NA>
5 Musicals General <NA> <NA> <NA> <NA> <NA>
6 Broadway & Vocalists <NA> <NA> <NA> <NA> <NA> <NA>