Question

我编写了过去几个月左右发送的推文语料库，看起来像这样（实际的语料库有更多的列，显然有更多的行，但你明白了）< / p>

id      when            time        day month   year    handle  what
UK1.1   Sat Feb 20 2016 12:34:02    20  2       2016    dave    Great goal by #lfc
UK1.2   Sat Feb 20 2016 15:12:42    20  2       2016    john    Can't wait for the weekend 
UK1.3   Sat Mar 01 2016 12:09:21    1   3       2016    smith   Generic boring tweet

现在我想在R中做的是，使用grep进行字符串匹配，绘制某些单词/主题标签的频率随着时间的推移，理想情况下通过该月/日/小时/不同的推文数量进行标准化。但我不知道该怎么做。

我知道如何使用grep来创建此数据帧的子集，例如对于包括#lfc标签在内的所有推文，我都不知道该去哪里。

另一个问题是，无论我的x轴（小时/日/月等）的时间尺度是多少，都需要数字化，而当＆＃39;专栏不是。我试过连接“＃day;＆＃39;和＆＃39;月＆＃39;列到类似于＆＃39; 2.13＆＃39; 2月13日，但这导致了R将2.13视为“早期”的问题，可以说是2.7（2月7日）的数学理由。

基本上，I'd like to make plots like these, where frequency of string x is plotted against time

谢谢！

Answer 1

这是白天统计推文的一种方法。我用简化的假数据集进行了说明：

library(dplyr)
library(lubridate)

# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000), 
                 what = sample(LETTERS, 10000, replace=TRUE))

tweet.summary = dat %>% group_by(day = date(time)) %>%  # To summarise by month: group_by(month = month(time, label=TRUE))
  summarise(total.tweets = n(),
            A.tweets = sum(grepl("A", what)),
            pct.A = A.tweets/total.tweets,
            B.tweets = sum(grepl("B", what)),
            pct.B = B.tweets/total.tweets)            

tweet.summary

          day total.tweets A.tweets      pct.A B.tweets      pct.B
1  2016-01-01           28        3 0.10714286        0 0.00000000
2  2016-01-02           27        0 0.00000000        1 0.03703704
3  2016-01-03           28        4 0.14285714        1 0.03571429
4  2016-01-04           27        2 0.07407407        2 0.07407407
...

以下是使用ggplot2绘制数据的方法。我还使用dplyr和reshape2软件包在ggplot中动态汇总了数据框：

library(ggplot2)
library(reshape2)
library(scales)

ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
         summarise(A = sum(grepl("A", what))/n(),
                   B = sum(grepl("B", what))/n()) %>%
         melt(id.var="Month"),
       aes(Month, value, colour=variable, group=variable)) +
  geom_line() +
  theme_bw() +
  scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
  labs(colour="", y="")

关于日期格式问题，以下是如何获取数字日期：您可以使用as.Date将日期月份和年份列转换为日期和/或转换日期，月份，年份和日期使用as.POSIXct将时间列添加到日期时间列中。两者都将具有附加日期类的基础数值，以便R在绘制函数和其他函数时将它们视为日期。完成此转换后，您可以运行上面的代码按天，月等计算推文。

# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016, 
                  time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))

# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-", 
                                         sprintf("%02d",month),"-", 
                                         sprintf("%02d", day)," ", 
                                         time)))

# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-", 
                                      sprintf("%02d",month),"-", 
                                      sprintf("%02d", day))))

dat2

   day month year  time          posix.date       date
1   28    10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2   22     6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3    3     4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4   15     8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5    6     2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6    2    12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7    4    11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8   12     3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9   24     5 2016 08:47 2016-05-24 08:47:00 2016-05-24 
10  27     1 2016 04:22 2016-01-27 04:22:00 2016-01-27

您可以通过as.numeric(dat2$posix.date)看到POSIXct日期的基础值是数字（自1970年1月1日午夜起经过的秒数）。同样，对于Date对象（自1970年1月1日以来经过的天数）：as.numeric(dat2$date)。

在R

1 个答案: