Question

我正在尝试通过以下循环从IMDB抓取日期，标题和评论：

   library(rvest)
   library(dplyr)
   library(stringr)
   library(tidyverse)

   ID <- 4633694

data <- lapply(paste0('http://www.imdb.com/title/tt', ID, '/reviews?filter=prolific', 1:20),
                   function(url){
                     url %>% read_html() %>% 
                       html_nodes(".review-date,.rating-other-user-rating,.title,.show-more__control") %>% 
                       html_text() %>%
                       gsub('[\r\n\t]', '', .)
                   })

其中哪些以以下格式（重复相同的模式）提供了20页的评论数据：

   col1
1 10/10
2 If this was..
3 14 December 2018
4 I have to say, and no...
5
6
7 10/10
8 Stan Lee Is Smiling Right Now...
9 17 December 2018
10 A movie worthy of...
11
12
13 10/10
14 the most visually stunning film I've ever seen...
15 20 December 2018
16 There's hardly anything... 
17.
18.

我想知道是否有一种方法可以将每4行转置为单独的列，以使每个属性在适当的列中对齐，如下所示：

         Date          Rating     Title            Review
1. 14 December 2018    10/10    If this was..    I have to...
2. 17 December 2018    10/10   Stan Lee Is...    A movie worthy...
3. 20 December 2018    10/10  the most visually.. There's hardly anything...

Answer 1

text_data = gsub('\\b(\\d+/\\d+)\\b','\n\\1',paste(grep('\\w',x$col1,value = TRUE),collapse = ':')) 

read.csv(text=text_data,h=F,sep=":",strip.white = T,fill=T,stringsAsFactors = F)
     V1                                                V2               V3                         V4 V5
1 10/10                                     If this was.. 14 December 2018   I have to say, and no... NA
2 10/10                  Stan Lee Is Smiling Right Now... 17 December 2018       A movie worthy of... NA
3 10/10 the most visually stunning film I've ever seen... 20 December 2018 There's hardly anything... NA

Answer 2

这是一种方法。

数据：

x <- read.csv2(header=TRUE, stringsAsFactors=FALSE, text="
col1
10/10
If this was..
14 December 2018
I have to say, and no...


10/10
Stan Lee Is Smiling Right Now...
17 December 2018
A movie worthy of...


10/10
the most visually stunning film I've ever seen...
20 December 2018
There's hardly anything... 
.
.")

首先，我们“查找”每个行，在这种情况下，它看起来像一个日期。请注意，您可能希望/需要对该正则表达式进行微调，以最大程度地减少误报和误报。

ind <- grep("^[0-9]+/[0-9]+", x$col1)
x$col1[ind]
# [1] "10/10" "10/10" "10/10"

每一行的底行都放在ind的第一行索引中。

从这里开始，我们提取每个块，直到下一个块开始（负1）为止，一直到帧列的结尾：

y <- Map(function(a,b) x$col[a:b], ind, c(ind[-1], nrow(x)))
str(y)
# List of 3
#  $ : chr [1:5] "10/10" "If this was.." "14 December 2018" "I have to say, and no..." ...
#  $ : chr [1:5] "10/10" "Stan Lee Is Smiling Right Now..." "17 December 2018" "A movie worthy of..." ...
#  $ : chr [1:6] "10/10" "the most visually stunning film I've ever seen..." "20 December 2018" "There's hardly anything... " ...

我们可以尝试前进（跳至下面的do.call），但是由于我们的向量大小不同，因此会遇到问题。我们可以通过将它们的长度设置为最长向量的长度来轻松解决此问题。这是这样做的窍门：

z <- lapply(y, `length<-`, max(lengths(y)))
str(z)
# List of 3
#  $ : chr [1:6] "10/10" "If this was.." "14 December 2018" "I have to say, and no..." ...
#  $ : chr [1:6] "10/10" "Stan Lee Is Smiling Right Now..." "17 December 2018" "A movie worthy of..." ...
#  $ : chr [1:6] "10/10" "the most visually stunning film I've ever seen..." "20 December 2018" "There's hardly anything... " ...

最后一步：

setNames(do.call("rbind.data.frame", c(z, stringsAsFactors=FALSE)),
         letters[seq_len(length(z[[1]]))])
#       a                                                 b                c
# 1 10/10                                     If this was.. 14 December 2018
# 2 10/10                  Stan Lee Is Smiling Right Now... 17 December 2018
# 3 10/10 the most visually stunning film I've ever seen... 20 December 2018
#                             d     e    f
# 1    I have to say, and no... 10/10 <NA>
# 2        A movie worthy of... 10/10 <NA>
# 3 There's hardly anything...      .    .

每4行转置为4列

2 个答案: