我是R的新手,我正在开展文本分析项目。我在这里的问题似乎无法清理/准备我的数据进行分析。以下是代码,但在运行聊天数据后,数据无法更改。
chat.df <- chat1
dim(chat.df)
# [1] 0 1
myCorpus <- Corpus(VectorSource(chat.df$text))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)
myStopwords <- c(stopwords("english"), "available", "via")
myStopwords <- setdiff(myStopwprds, c("r", "big"))
这是运行这些脚本后的聊天数据。我也附上了我的档案。我需要帮助。
To.link.your.wife.s.email.address.to.a.digital.account.please.follow.these.steps.As.a.registered.user.of.Boston.Globe.Subscriber.Services.your.account.has.already.been.linked.To.confirm.your.account.information.is.correct.please.login.to.Subscriber.Services..httpsservicesbostonglobecomregistrationsubDefaultaspx...I.see.that.you.want.to.view.your.vacation.stops.I.have.two.stops.recorded.The.first.is.a.stop.on.April.3.with.a.resume.date.of.April.7.The.second.is.a.stop.date.of.May.9.with.a.resume.date.of.May.11.If.you.would.like.to.make.any.changes.to.these.vacation.stops.we.are.happy.to.help.you.If.you.are.having.trouble.receiving.today.s.paper.we.suggest.a.manual.download.This.can.be.done.by.going.to.the.Store..select.today.s.date.and.then.downloadltbr.gtltbr.gtIf.there.is.anything.else.we.can.assist.you.with.please.let.us.know.In.30.seconds.I.will.need.to.terminate.the.chat.due.to.no.response..Let.me.know.if.you.are.still.there.Ms.Piasecki.we.has.a.late.truck.in.this.morning.and.all.papers.should.be.delivered.until.9am.I.will.alert.the.distribution.center.manager.to.contact.you.regarding.this.issue.so.he.may.better.understand.how.his.local.carriers.can.improve.their.serviceltbr.gtltbr.gt.We.apologize.for.any.inconvenience.this.delay.may.causeltbr.gt.Let..us.know.if.there.is.anything.else.we.can.help.you.with.today..Our.subscriber.management.system.is.currently.down.for.maintenance.and.as.a.result.I.am.unable.to.make.any.changes.to.accounts.Please.contact.us.tomorrow.with.your.request.when.our.system.is.back.up.and.we.will.be.happy.to.assist.you.further.picture.of.a.head.Please.let.us.know.if.there.is.anything.we.can.help.you.withltbr.gt.We.re.sorry.to.hear.that.you.re.having.trouble.with.the.app..We.are.aware.of.the.issue.since.the.IOS.8.update..We.suggest.de.authorizing.from..the.app.and.then.re.authorizing.This.can.be.done.by.going.to.the.Settings.and.then.Swipe.to.de.authorizeltbr.gtltbr.gtOnce.de.authorized.please.authenticate.in.the.app.againltbr.gtltbr.gt..Click.on.Settings.on.the.iPhone.or.on.the.iPad.click.the.person.icon.in.the.top.right.corner.of.the.Appltbr.gt..Under.Account.click.on..preview.and.then.under.Registered.User.enter.your.BGcom.e.mail.address.and.password..ltbr.gtltbr.gtOnce.re.authorized.please.try.downloading.today.s.paper.again.This.can.be.done.by.going.to.the.Store..select.today.s.date.and.then.download.You.re.all.set.Is.there.anything.else.I.can.assist.you.with..Your.account.will.be.credited.during.your.time.away.If.you.would.prefer.to.donate.any.days.to.Newspapers.in.Education.please.let.us.know.3.Pickwick.Way.3..Go.to.Pressreader.and.it.will.list.the.authorized.devices.7.day.home.delivery.is.1099.a.week.8.o.clock.is.the.standard.delivery.time.on.weekends.102614.is.your.new.paid.to.date
答案 0 :(得分:1)
使用Base R,您可以轻松清理字符串:
x <- tolower("Time.on.weekends.102614.is.your.New.paid.to")
gsub("[[:digit:][:punct:]']", " ", x)
[1] "time on weekends is your new paid to"
y <- gsub("[0-9]","","time.on.weekends.102614.is.your.new.paid.to")
gsub("[[:punct:]]"," ", y)
[1] "time on weekends is your new paid to"
答案 1 :(得分:0)
为什么在创建语料库之前不删除标点符号? stringr
方法优于tm::removePunctuation
方法,因为它留下了标点符号所在的空间。
您可以通过其他来电删除数字。
library(stringr)
df <- "o.link.your.wife.s.email.address.to.a.digital.account.please.follow.these.steps.As.a.registered.user.of.Boston.Globe.Subscriber.Services.your.account.has.already.been." You could extend this to remove digits also.
text <- str_replace_all(df, pattern = "[[:punct:]]", " ")
> text
[1] "o link your wife s email address to a digital account please follow these steps As a registered user of Boston Globe Subscriber Services your account has already been "