Question

我在一个文件夹中有一堆.txt文件（文章），我使用for循环来从R上获取所有这些文件

input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
  text <- c(text, paste(readLines(f), collapse = "\n"))
}

从这里开始，我按段落标记，然后在每篇文章中得到每个段落：

paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs

然后我取消列表并转换为数据帧

par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)

但这样做我不再对段落编号进行文章间的分离（例如，第一篇文章有6个段落，在第二篇文章的第一段未列出之前仍然会有[1]在前面，而在未列出之后它将会有一个[7]）。我想要做的是，一旦我有数据框，有一个带有段落编号的列，然后创建另一个名为“article”的列，其中包含文章的编号。提前谢谢

EDIT 这大概是我到达paragraphs后得到的：

> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise 
tag on wide receiver Jarvis Landry."                                                                                                                                                                                                                                         

[2] "The Dolphins tweeted the announcement Tuesday, the first day teams 
could use their franchise or transition tags. The salary for wide receivers 
getting the franchise tag this offseason is expected to be around $16.2 
million, which will be quite the raise for Landry, who made $894,000 last 
season."    
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations, 
Jarvis Landry has often stated his desire to stay in Miami."                                                                                                                                                                                                                                                                                                  

[2] "The Dolphins used their lone tool to wipe away negotation-driven stress 
-- at least in the immediate future -- and ensure Landry won't be lured away 
from Miami, placing the franchise tag on the receiver on Tuesday, the team 
announced."

我希望将段落编号（[n]）保留为数据框中的一列，因为当我取消它们时，它们不再按文章分开，然后按段落分开，但我按顺序得到它们，让我们说（基本上在我刚刚发布的例子中我不再拥有

[[1]]
[1] ...
[2] ...

[[2]]
[1] ...
[2] ...

但我得到

[1] ...
[2] ...
[3] ...
[4] ...

Answer 1

考虑遍历段落列表并构建一个数据框列表，其中包含所需的文章和段落编号，以及通过所有数据框元素的最终行绑定。

输入数据

paragraphs <- list(
     c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",   
        "The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers 
getting the franchise tag this offseason is expected to be around $16.2 million, which will be quite the raise for Landry, who made $894,000 last 
season."),
     c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
      "The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away 
from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))

Dataframe Build

df_list <- lapply(seq_along(paragraphs), function(i)

  setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]), 
           c("article_num", "paragraph_num", "paragraph"))      
)

final_df <- do.call(rbind, df_list)

输出结果

final_df

#   article_num paragraph_num                                             paragraph
# 1           1             1 The Miami Dolphins have decided to use their non-e...
# 2           1             2 The Dolphins tweeted the announcement Tuesday, the...
# 3           2             1 Despite months of little-to-no movement on contrac...
# 4           2             2 The Dolphins used their lone tool to wipe away neg...

将行号保留在数据框列中

1 个答案: