如何组织SQL数据库来存储书籍文本分析数据

时间:2015-12-28 02:14:18

标签: sql database performance data-structures

我使用Stanford NLP库分析了3本书。我在页面基础上运行我的分析,对于每本书,这是我得到的输出:

// An array of length P, where P is the total number of pages in the book
// so that pageSentiment[0] represents the sentiment of the page 1.
float[] pageSentiment

// An array of length P, where P is the total number of pages in the book
// so that pageWords[0] represents the number of words in the page 1.
int[] pageWords

// An array of length W, where W is the number of unique words in the book
// where, for example, bookWords[0] has the following values
//   word = "then"
//   data[0] = {1, 1, 2} => the word "then" occurs 2 times in page 1 (associated to chapter 1)
//   data[1] = {1, 2, 1} => the word "then" occurs 1 times in page 2 (associated to chapter 1)
//   data[2] = {1, 3, 0} => the word "then" occurs 0 times in page 3 (associated to chapter 1)
//   data[3] = {1, 4, 0} => the word "then" occurs 0 times in page 4 (associated to chapter 1)
//   data[4] = {2, 5, 3} => the word "then" occurs 3 times in page 5 (associated to chapter 2)
//   data[5] = ...
struct WordData { string word; int[,,] data; }
WordData[] bookWords

现在......我必须将所有这些结果存储到SQL数据库中,以便我可以访问它以绘制网页中的图形和统计表格。现在,我想弄清楚的是以灵活的方式存储所有这些值的正确方法,这样我就可以轻松地向数据库发送不同的查询,以获得符合我当前需求的不同输出。例如......我需要能够:

  • 绘制关于单词count(pageWords)的直方图,其中
    每列可以是页面或章节(在这种情况下我需要 聚合页面值;;
  • 逐页或按章节查看单词的频率;
  • 打印每本书的全球图书价值;
  • 等...

请问有关SQL表结构的任何建议吗?

1 个答案:

答案 0 :(得分:1)

只有3张桌子

book
---
book_id
title
...

word
---
word_id
text
...

和包含结果的多对多表

word_2_book
---
word_id
book_id
page_no
chapter_no
word_count

然后只是

select * 
from word_2_book wb
where wb.book_id=? and wb.word_id=?

您可以应用任何聚合函数