R:有可能操纵用tm制作的语料库吗?

时间:2013-02-26 12:54:21

标签: r corpus tm

我用一个名为userbios的数据框创建了一个语料库,该数据框有三列; list_id,twitter_id和bio。

然后我将userbios $ bio列转换为语料库并在其上运行LDA。

然而,我意识到我应该首先通过twitter_id汇总所有bios,因为只有大约2.000个独特的twitter ID,以及30.000 list_ids(和30.000个bios,每个都对应一个list_id),我想要对待每个独特的twitter_id的相应BIOS作为单独的文档。

我可以再次运行代码并预先聚合然后将其转换为语料库,但是,我花了8到9个小时来清理数据(删除停用词,urls..etc),每当我尝试在新的聚合文档,应用程序停止响应。

有没有办法可以获取已经清理过的文档并根据twitter_id聚合它们?

以下是一个示例:

library(tm)
sample_userbios <- read.csv("sample.csv")
sample_myCorpus <- Corpus(VectorSource(sample_userbios$bio))

sample_myCorpus <- tm_map(sample_myCorpus, tolower)
sample_myCorpus <- tm_map(sample_myCorpus, removePunctuation)
sample_myCorpus <- tm_map(sample_myCorpus, removeNumbers)
sample_myCorpus <- tm_map(sample_myCorpus, removeWords, stopwords('english'))

所以现在我有以下文件:userbios,一个包含三列的数据框:list_id,twitter_id和bio。我有唯一的list_ids,但非唯一的twitter_ids,bio属于每个list_id,但我想基于twitter_ids聚合这些bios。我可以这样做:

desired_corpus_file <- aggregate(bio ~ twitter_id, sample_userbios, paste, collapse = " ")
sample_desired_corpus <- Corpus(VectorSource(desired_corpus_file$bio))

sample_desired_corpus <- tm_map(sample_myCorpus, tolower)
sample_desired_corpus <- tm_map(sample_myCorpus, removePunctuation)
sample_desired_corpus <- tm_map(sample_myCorpus, removeNumbers)
sample_desired_corpus <- tm_map(sample_myCorpus, removeWords, stopwords('english'))

但问题是我已经将所有这些文件都删除了数字和停用词,我不想再次执行这些数据清理步骤,因为它需要花费很多时间而且我的计算机甚至无法处理它(即使我让脚本一夜之间运行,它也会在删除停用词时停止响应)。无论如何我可以基于twitter_id聚合这些sample_myCorpus文档而无需重新运行清理步骤吗?

以下是dput文件:

SAMPLE_USERBIOS

structure(list(lit_id = c(23L, 34L, 54L, 32L, 12L, 87L, 65L, 
43L, 22L, 10L), twitter_id = c(12345L, 12346L, 12347L, 12345L, 
12348L, 12347L, 12456L, 12457L, 12456L, 12345L), bio = structure(c(7L, 
8L, 10L, 4L, 1L, 3L, 6L, 9L, 5L, 2L), .Label = c(" overhauled its website once more to feature the \"Fly\" design, which the service says is easier for new users to follow and promotes advertising.", 
"According to Quancast, twenty-seven million people in the US used Twitter as of September 3, 2009. Sixty-three percent of Twitter users are under thirty-five years old; sixty percent of Twitter users are Caucasian,", 
"Due to an influx of inappropriate content, it is now rated 17+ in Apple's app store.", 
"Reaction at the conference was highly positive.", "There are numerous tools for adding content, monitoring content and conversations including Twitvid (video sharing)", 
"Twitter has become internationally identifiable by its signature bird logo. The original logo was in use from its launch in March 2006 until September 2010.", 
"Twitter is an online social networking service and microblogging service that enables its users to send and read text-based messages of up to 140 characters, known as \"tweets\".", 
"Twitter was created in March 2006 by Jack Dorsey and by July, the social networking site was launched. The service rapidly gained worldwide popularity, with over 500 million registered users as of 2012, ", 
"Users can group posts together by topic or type by use of hashtags – words or phrases prefixed with a \"#\" sign. Similarly, the \"@\" sign followed by a username is used for mentioning or replying to other users", 
"while registered users can post tweets through the website interface, SMS, or a range of apps for mobile devices."
), class = "factor")), .Names = c("lit_id", "twitter_id", "bio"
), class = "data.frame", row.names = c(NA, -10L))

SAMPLE_MYCORPUS:

structure(list(structure("twitter   online social networking service  microblogging service  enables  users  send  read textbased messages     characters   tweets", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8675119876862, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "1", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("twitter  created  march   jack dorsey   july  social networking site  launched  service rapidly gained worldwide popularity    million registered users    ", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8677449226379, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "2", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure(" registered users  post tweets   website interface sms   range  apps  mobile devices", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8678750991821, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "3", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("reaction   conference  highly positive", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8679978847504, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "4", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure(" overhauled  website    feature  fly design   service   easier   users  follow  promotes advertising", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8681170940399, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "5", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("due   influx  inappropriate content    rated   apples app store", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8682401180267, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "6", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("twitter   internationally identifiable   signature bird logo  original logo      launch  march   september ", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8683569431305, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "7", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("users   posts   topic  type    hashtags  words  phrases prefixed    sign similarly   sign followed   username    mentioning  replying   users", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8684740066528, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "8", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("  numerous tools  adding content monitoring content  conversations including twitvid video sharing", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8685901165009, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "9", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character")), structure("according  quancast twentyseven million people     twitter   september   sixtythree percent  twitter users   thirtyfive   sixty percent  twitter users  caucasian", Author = character(0), DateTimeStamp = structure(list(
        sec = 46.8687040805817, min = 11L, hour = 12L, mday = 26L, 
        mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "10", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
    "TextDocument", "character"))), CMetaData = structure(list(NodeID = 0, 
        MetaData = structure(list(create_date = structure(list(sec = 46.8691101074219, 
            min = 11L, hour = 12L, mday = 26L, mon = 1L, year = 113L, 
            wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
        "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
        ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), creator = "demet"), .Names = c("create_date", 
        "creator")), Children = NULL), .Names = c("NodeID", "MetaData", 
    "Children"), class = "MetaDataNode"), DMetaData = structure(list(
        MetaID = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = "MetaID", row.names = c(NA, 
    -10L), class = "data.frame"), class = c("VCorpus", "Corpus", 
    "list"))

DESIRED_CORPUS_FILE

structure(list(twitter_id = c(12345L, 12346L, 12347L, 12348L, 
12456L, 12457L), bio = c("Twitter is an online social networking service and microblogging service that enables its users to send and read text-based messages of up to 140 characters, known as \"tweets\". Reaction at the conference was highly positive. According to Quancast, twenty-seven million people in the US used Twitter as of September 3, 2009. Sixty-three percent of Twitter users are under thirty-five years old; sixty percent of Twitter users are Caucasian,", 
"Twitter was created in March 2006 by Jack Dorsey and by July, the social networking site was launched. The service rapidly gained worldwide popularity, with over 500 million registered users as of 2012, ", 
"while registered users can post tweets through the website interface, SMS, or a range of apps for mobile devices. Due to an influx of inappropriate content, it is now rated 17+ in Apple's app store.", 
" overhauled its website once more to feature the \"Fly\" design, which the service says is easier for new users to follow and promotes advertising.", 
"Twitter has become internationally identifiable by its signature bird logo. The original logo was in use from its launch in March 2006 until September 2010. There are numerous tools for adding content, monitoring content and conversations including Twitvid (video sharing)", 
"Users can group posts together by topic or type by use of hashtags – words or phrases prefixed with a \"#\" sign. Similarly, the \"@\" sign followed by a username is used for mentioning or replying to other users"
)), .Names = c("twitter_id", "bio"), row.names = c(NA, -6L), class = "data.frame")

SAMPLE_DESIRED_CORPUS

structure(list(structure("twitter   online social networking service  microblogging service  enables  users  send  read textbased messages     characters   tweets", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8675119876862, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "1", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("twitter  created  march   jack dorsey   july  social networking site  launched  service rapidly gained worldwide popularity    million registered users    ", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8677449226379, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "2", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure(" registered users  post tweets   website interface sms   range  apps  mobile devices", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8678750991821, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "3", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("reaction   conference  highly positive", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8679978847504, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "4", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure(" overhauled  website    feature  fly design   service   easier   users  follow  promotes advertising", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8681170940399, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "5", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("due   influx  inappropriate content    rated   apples app store", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8682401180267, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "6", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("twitter   internationally identifiable   signature bird logo  original logo      launch  march   september ", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8683569431305, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "7", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("users   posts   topic  type    hashtags  words  phrases prefixed    sign similarly   sign followed   username    mentioning  replying   users", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8684740066528, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "8", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("  numerous tools  adding content monitoring content  conversations including twitvid video sharing", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8685901165009, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "9", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character")), structure("according  quancast twentyseven million people     twitter   september   sixtythree percent  twitter users   thirtyfive   sixty percent  twitter users  caucasian", Author = character(0), DateTimeStamp = structure(list(
    sec = 46.8687040805817, min = 11L, hour = 12L, mday = 26L, 
    mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "10", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument", 
"TextDocument", "character"))), CMetaData = structure(list(NodeID = 0, 
    MetaData = structure(list(create_date = structure(list(sec = 46.8691101074219, 
        min = 11L, hour = 12L, mday = 26L, mon = 1L, year = 113L, 
        wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), creator = "demet"), .Names = c("create_date", 
    "creator")), Children = NULL), .Names = c("NodeID", "MetaData", 
"Children"), class = "MetaDataNode"), DMetaData = structure(list(
    MetaID = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = "MetaID", row.names = c(NA, 
-10L), class = "data.frame"), class = c("VCorpus", "Corpus", 
"list"))

0 个答案:

没有答案