如何从语料库中提取特定文本?

时间:2019-10-31 07:27:12

标签: r corpus

我有一个包含213个文档的语料库,这些文档的长度各不相同。我的目的是从每个文档中提取一个特定的文本部分,该文本涉及“财政政策”。使我的尝试变得复杂的是,我要提取的文本内容在文本之间是不同的。经常在开头出现的唯一关键词是财政政策财政政策,但仅此而已。

让我们举个例子:

df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"))

cp <- corpus (df)

最终目的是获得这样的语料库:

df <- data.frame(Text = c("As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future.", "Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes.", "As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries.", "Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes."))

cp <- corpus(df)

请注意,即使我只是对一点点的兴趣加上我不想要的“更多文本”,我也会很高兴。我可以简单地将其子集化。我虽然无法到达那里。到目前为止,我尝试使用 corpus_segment 以及尝试使用数据框均未成功。

有人可以帮我吗?

非常感谢!

1 个答案:

答案 0 :(得分:3)

不需要语料库功能的Base R解决方案:

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

针对另一个问题-查找索引并使用它的子集数据:

# Return vector of sentences containing pattern: 

trimws(grep("fiscal polic.*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

# Store the matched text as a vector: 

matched_text <- trimws(grep("fiscal .*", unlist(strsplit(df$Text, "[.]")), ignore.case = TRUE, value = TRUE), "both")

#Get the index of the dataframe for each element:

matched_text_idx <- sapply(matched_text, function(x){which(grepl(x, df$Text))})

# If you want to subset the dataframe to contain only the elements which contain pattern: 

df$Text[(which(grepl("fiscal polic.*", df$Text)))]

数据:

    df <- data.frame(Text = c("Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. Regarding regards fiscal policies, almost all euro area countries have submitted their updated stability programmes. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. As regards fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. These indications are a cause of concern and entail risks for the future. Against the background of current good times, it is essential that sound budgetary positions are reached in countries with fiscal imbalances and that a pro-cyclical loosening is avoided in all member countries. MORE TEXT", "Stackoverflow is a great place where very skilled people can give you advice on coding. It is so good that I hope they are going to sort this problem out. This problem is really killing me. Turning to fiscal policy, almost all euro area countries have submitted their updated stability programmes. While these programmes generally indicate that governments plan to proceed towards sound budgetary positions, there are also indications that budget targets do not consistently imply sufficient consolidation and that concrete and credible measures have not yet been specified in all programmes. MORE TEXT"), stringsAsFactors = FALSE)