Question

我是这个问题的相关领域的初级程序员，所以如果可能的话，避免假设我已经了解很多很有帮助。

我正在尝试将OpenLibrary数据集导入到本地Postgres数据库中。导入之后，我计划将它用作Ruby on Rails应用程序的起始种子，该应用程序将包含书籍信息。

此处提供OpenLibrary数据集，采用修改后的JSON格式： http://openlibrary.org/dev/docs/jsondump

我只需要为我的应用程序提供非常基本的信息，远远少于转储中提供的信息。我只想弄出书名，作者姓名以及书籍和作者之间的关系。

以下是他们数据集中的两个典型条目，第一个是作者，第二个是书（他们似乎每本书都有一个条目）。在包含实际的JSON数据库转储之前，这些条目似乎带有主键，然后带有类型。

/ a / OL2A / type / author {“name”：“U. Venkatakrishna Rao”，“personal_name”：“U。Venkatakrishna Rao”，“last_modified”：{“type”：“/ type / datetime”， “value”：“2008-09-10 08：44：01.978456”}，“key”：“/ a / OL2A”，“birth_date”：“1904”，“type”：{“key”：“/ type /作者“}，”id“：99，”revision“：3}

/ b / OL345M / type / edition {“publishers”：[“社会科学研究项目，地理系，达卡大学”]，“分页”：“ii，54 p。”，“title”： “Fayadabad地区的土地使用”，“lccn”：[“sa 65000491”]，“subject_place”：[“东巴基斯坦”，“Dacca地区。”]，“number_of_pages”：54，“语言”：[{“评论“：”initial import“，”code“：”eng“，”name“：”English“，”key“：”/ l / eng“}]，”lc_classifications“：[”S471.P162 E23“]，” publish_date“：”1963“，”publish_country“：”pk“，”key“：”/ b / OL345M“，”作者“：[{”birth_date“：”1911“，”name“：”Nafis Ahmad“，” key“：”/ a / OL302A“，”personal_name“：”Nafis Ahmad“}，”publish_places“：[”Dacca，East Pakistan“]，”by_statement“：”[由] Nafis Ahmad和F. Karim Khan。 “，”oclc_numbers“：[”4671066“]，”贡献“：[”Khan，Fazle Karim，联合作者。“]，”主题“：[”土地使用 - 东巴基斯坦 - 达卡地区。“]} < / p>

未压缩转储的大小非常大，作者列表大约为2GB，书籍版本列表大小为18GB。 OpenLibrary本身不提供任何工具，它们提供了一个简单的未优化的Python脚本，用于读取样本数据（与实际转储不同，它采用纯JSON格式），但他们估计是否修改了它以用于实际数据。需要2个月（！）才能完成加载数据。

如何将其读入数据库？我假设我需要编写一个程序来执行此操作。关于如何在合理的时间内完成任务的语言和指导？我有经验的唯一脚本语言是Ruby。

Answer 1

从他们的网站下载转储需要两个月的时间。但是导入它只需要几个小时。

最快的方法是使用Postgres的复制命令。您可以将其用于作者的文件。但是需要在books和author_books表中插入版本文件。

此脚本在Python 2.6中，但如果需要，您应该能够适应Ruby。

!#/usr/bin/env python
import json

fp = open('editions.json')
ab_out = open('/tmp/author_book.dump', 'w')
b_out = open('/tmp/book.dump', 'w')
for line in fp:
  vals = json.loads(s.split('/type/edition ')[1])
  b_out.write("%(key)s\t%(title)s\t(publish_date)s" % vals)
  for author in vals['authors']:
    ab_out.write("%s\t%s" % (vals['key'], author['key'])
fp.close()
ab_out.close()
b_out.close()

然后复制到Postgres：

COPY book_table FROM '/tmp/book.dump'

Answer 2

dunno如果TAPS会在这里为您提供帮助，http://adam.heroku.com/past/2009/2/11/taps_for_easy_database_transfers/

Answer 3

根据Scott Bailey的建议，我编写了Ruby脚本，将JSON修改为Postgres复制命令可接受的格式。如果其他人遇到同样的问题，这里是我写的脚本：

require 'rubygems'
require 'json'

fp = File.open('./edition.txt', 'r')
ab_out = File.new('./author_book.dump', 'w')
b_out = File.new('./book.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["title"].nil?
      next
    end
    title = vals["title"]
    #Some titles contain backslashes and tabs, which we need to escape and remove, respectively
    title.gsub! /\\/, "\\\\\\\\"
    title.gsub! /\t/, " "
    if ((vals["isbn_10"].nil? || vals["isbn_10"].empty?) && (vals["isbn_13"].nil? || vals["isbn_13"].empty?))
      b_out.puts vals["key"] + "\t" + title + "\t" + '\N' + "\n"
    #Only get the first ISBN number
    elsif (!vals["isbn_10"].nil? && !vals["isbn_10"].empty?) 
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_10"][0] + "\n"
    elsif (!vals["isbn_13"].nil? && !vals["isbn_13"].empty?)
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_13"][0] + "\n"    
    end
    if vals["authors"]
      for author in vals["authors"]
        if !author["key"].nil?
          ab_out.puts vals["key"] + "\t" + author["key"]
        end
      end
    end
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
ab_out.close
b_out.close

和

require 'rubygems'
require 'json'

fp = File.open('./author.txt', 'r')
a_out = File.new('./author.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["name"].nil?
      next
    end
    name = vals["name"]
    name.gsub! /\\/, "\\\\\\\\"
    name.gsub! /\t/, " "
    a_out.puts vals["key"] + "\t" + name + "\n"
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
a_out.close

将大型数据集导入数据库

3 个答案: