我用一个名为userbios的数据框创建了一个语料库,该数据框有三列; list_id,twitter_id和bio。
然后我将userbios $ bio列转换为语料库并在其上运行LDA。
然而,我意识到我应该首先通过twitter_id汇总所有bios,因为只有大约2.000个独特的twitter ID,以及30.000 list_ids(和30.000个bios,每个都对应一个list_id),我想要对待每个独特的twitter_id的相应BIOS作为单独的文档。
我可以再次运行代码并预先聚合然后将其转换为语料库,但是,我花了8到9个小时来清理数据(删除停用词,urls..etc),每当我尝试在新的聚合文档,应用程序停止响应。
有没有办法可以获取已经清理过的文档并根据twitter_id聚合它们?
以下是一个示例:
library(tm)
sample_userbios <- read.csv("sample.csv")
sample_myCorpus <- Corpus(VectorSource(sample_userbios$bio))
sample_myCorpus <- tm_map(sample_myCorpus, tolower)
sample_myCorpus <- tm_map(sample_myCorpus, removePunctuation)
sample_myCorpus <- tm_map(sample_myCorpus, removeNumbers)
sample_myCorpus <- tm_map(sample_myCorpus, removeWords, stopwords('english'))
所以现在我有以下文件:userbios,一个包含三列的数据框:list_id,twitter_id和bio。我有唯一的list_ids,但非唯一的twitter_ids,bio属于每个list_id,但我想基于twitter_ids聚合这些bios。我可以这样做:
desired_corpus_file <- aggregate(bio ~ twitter_id, sample_userbios, paste, collapse = " ")
sample_desired_corpus <- Corpus(VectorSource(desired_corpus_file$bio))
sample_desired_corpus <- tm_map(sample_myCorpus, tolower)
sample_desired_corpus <- tm_map(sample_myCorpus, removePunctuation)
sample_desired_corpus <- tm_map(sample_myCorpus, removeNumbers)
sample_desired_corpus <- tm_map(sample_myCorpus, removeWords, stopwords('english'))
但问题是我已经将所有这些文件都删除了数字和停用词,我不想再次执行这些数据清理步骤,因为它需要花费很多时间而且我的计算机甚至无法处理它(即使我让脚本一夜之间运行,它也会在删除停用词时停止响应)。无论如何我可以基于twitter_id聚合这些sample_myCorpus文档而无需重新运行清理步骤吗?
以下是dput文件:
SAMPLE_USERBIOS
structure(list(lit_id = c(23L, 34L, 54L, 32L, 12L, 87L, 65L,
43L, 22L, 10L), twitter_id = c(12345L, 12346L, 12347L, 12345L,
12348L, 12347L, 12456L, 12457L, 12456L, 12345L), bio = structure(c(7L,
8L, 10L, 4L, 1L, 3L, 6L, 9L, 5L, 2L), .Label = c(" overhauled its website once more to feature the \"Fly\" design, which the service says is easier for new users to follow and promotes advertising.",
"According to Quancast, twenty-seven million people in the US used Twitter as of September 3, 2009. Sixty-three percent of Twitter users are under thirty-five years old; sixty percent of Twitter users are Caucasian,",
"Due to an influx of inappropriate content, it is now rated 17+ in Apple's app store.",
"Reaction at the conference was highly positive.", "There are numerous tools for adding content, monitoring content and conversations including Twitvid (video sharing)",
"Twitter has become internationally identifiable by its signature bird logo. The original logo was in use from its launch in March 2006 until September 2010.",
"Twitter is an online social networking service and microblogging service that enables its users to send and read text-based messages of up to 140 characters, known as \"tweets\".",
"Twitter was created in March 2006 by Jack Dorsey and by July, the social networking site was launched. The service rapidly gained worldwide popularity, with over 500 million registered users as of 2012, ",
"Users can group posts together by topic or type by use of hashtags – words or phrases prefixed with a \"#\" sign. Similarly, the \"@\" sign followed by a username is used for mentioning or replying to other users",
"while registered users can post tweets through the website interface, SMS, or a range of apps for mobile devices."
), class = "factor")), .Names = c("lit_id", "twitter_id", "bio"
), class = "data.frame", row.names = c(NA, -10L))
SAMPLE_MYCORPUS:
structure(list(structure("twitter online social networking service microblogging service enables users send read textbased messages characters tweets", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8675119876862, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "1", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("twitter created march jack dorsey july social networking site launched service rapidly gained worldwide popularity million registered users ", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8677449226379, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "2", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure(" registered users post tweets website interface sms range apps mobile devices", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8678750991821, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "3", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("reaction conference highly positive", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8679978847504, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "4", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure(" overhauled website feature fly design service easier users follow promotes advertising", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8681170940399, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "5", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("due influx inappropriate content rated apples app store", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8682401180267, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "6", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("twitter internationally identifiable signature bird logo original logo launch march september ", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8683569431305, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "7", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("users posts topic type hashtags words phrases prefixed sign similarly sign followed username mentioning replying users", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8684740066528, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "8", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure(" numerous tools adding content monitoring content conversations including twitvid video sharing", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8685901165009, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "9", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("according quancast twentyseven million people twitter september sixtythree percent twitter users thirtyfive sixty percent twitter users caucasian", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8687040805817, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "10", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character"))), CMetaData = structure(list(NodeID = 0,
MetaData = structure(list(create_date = structure(list(sec = 46.8691101074219,
min = 11L, hour = 12L, mday = 26L, mon = 1L, year = 113L,
wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), creator = "demet"), .Names = c("create_date",
"creator")), Children = NULL), .Names = c("NodeID", "MetaData",
"Children"), class = "MetaDataNode"), DMetaData = structure(list(
MetaID = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = "MetaID", row.names = c(NA,
-10L), class = "data.frame"), class = c("VCorpus", "Corpus",
"list"))
DESIRED_CORPUS_FILE
structure(list(twitter_id = c(12345L, 12346L, 12347L, 12348L,
12456L, 12457L), bio = c("Twitter is an online social networking service and microblogging service that enables its users to send and read text-based messages of up to 140 characters, known as \"tweets\". Reaction at the conference was highly positive. According to Quancast, twenty-seven million people in the US used Twitter as of September 3, 2009. Sixty-three percent of Twitter users are under thirty-five years old; sixty percent of Twitter users are Caucasian,",
"Twitter was created in March 2006 by Jack Dorsey and by July, the social networking site was launched. The service rapidly gained worldwide popularity, with over 500 million registered users as of 2012, ",
"while registered users can post tweets through the website interface, SMS, or a range of apps for mobile devices. Due to an influx of inappropriate content, it is now rated 17+ in Apple's app store.",
" overhauled its website once more to feature the \"Fly\" design, which the service says is easier for new users to follow and promotes advertising.",
"Twitter has become internationally identifiable by its signature bird logo. The original logo was in use from its launch in March 2006 until September 2010. There are numerous tools for adding content, monitoring content and conversations including Twitvid (video sharing)",
"Users can group posts together by topic or type by use of hashtags – words or phrases prefixed with a \"#\" sign. Similarly, the \"@\" sign followed by a username is used for mentioning or replying to other users"
)), .Names = c("twitter_id", "bio"), row.names = c(NA, -6L), class = "data.frame")
SAMPLE_DESIRED_CORPUS
structure(list(structure("twitter online social networking service microblogging service enables users send read textbased messages characters tweets", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8675119876862, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "1", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("twitter created march jack dorsey july social networking site launched service rapidly gained worldwide popularity million registered users ", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8677449226379, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "2", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure(" registered users post tweets website interface sms range apps mobile devices", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8678750991821, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "3", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("reaction conference highly positive", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8679978847504, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "4", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure(" overhauled website feature fly design service easier users follow promotes advertising", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8681170940399, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "5", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("due influx inappropriate content rated apples app store", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8682401180267, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "6", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("twitter internationally identifiable signature bird logo original logo launch march september ", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8683569431305, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "7", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("users posts topic type hashtags words phrases prefixed sign similarly sign followed username mentioning replying users", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8684740066528, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "8", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure(" numerous tools adding content monitoring content conversations including twitvid video sharing", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8685901165009, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "9", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character")), structure("according quancast twentyseven million people twitter september sixtythree percent twitter users thirtyfive sixty percent twitter users caucasian", Author = character(0), DateTimeStamp = structure(list(
sec = 46.8687040805817, min = 11L, hour = 12L, mday = 26L,
mon = 1L, year = 113L, wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Description = character(0), Heading = character(0), ID = "10", Language = "en", LocalMetaData = list(), Origin = character(0), class = c("PlainTextDocument",
"TextDocument", "character"))), CMetaData = structure(list(NodeID = 0,
MetaData = structure(list(create_date = structure(list(sec = 46.8691101074219,
min = 11L, hour = 12L, mday = 26L, mon = 1L, year = 113L,
wday = 2L, yday = 56L, isdst = 0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), creator = "demet"), .Names = c("create_date",
"creator")), Children = NULL), .Names = c("NodeID", "MetaData",
"Children"), class = "MetaDataNode"), DMetaData = structure(list(
MetaID = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = "MetaID", row.names = c(NA,
-10L), class = "data.frame"), class = c("VCorpus", "Corpus",
"list"))