使用存储在R中的JSON文件中的文本创建语料库

时间:2016-09-23 18:48:28

标签: json r corpus

我有几个JSON文件,文本分为datebodytitle。举个例子来考虑:

{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990.  Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"}
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile.  Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"}

我有三份不同的报纸,其中包含1989年至2016年期间生成的所有文本的单独文件。我的最终目标是将所有文本合并为一个语料库。我使用pandas库在Python中完成它,我想知道它是否可以在R中完成。这是我在R:中的循环代码:

for (i in 1989:2016){
  df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
  df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
  df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
  appended_data.append(df0)
  appended_data.append(df1)
  appended_data.append(df2)
}

3 个答案:

答案 0 :(得分:3)

使用jsonlite::stream_in来阅读您的文件,并jsonlite::rbind.pages将它们合并。

答案 1 :(得分:2)

R中有很多选项可以读取json文件并将它们转换为data.frame / data.table。

此处使用jsonlitedata.table

library(data.table)
library(jsonlite)
res <- lapply(1989:2016,function(i){
  ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json')
  list_files_paths <- sprintf(ff,i)
  rbindlist(lapply(list_files_paths,fromJSON))
  })

这里res是data.table的列表。如果要在单个data.table中汇总所有data.table:

  rbindlist(res) 

答案 2 :(得分:0)

使用ndjson::stream_injsonlite::stream_in更快更平坦地阅读它们: - )