Question

我有一个文本块，我试图在java中解释（或使用grep / awk / etc），如下所示：

   Somewhat differently, plaques of the rN8 and rN9 mutants            and human coronavirus OC43 as well as the more divergent
   were of fully wild-type size, indicating that the suppressor mu-    SARS-CoV, human coronavirus HKU1, and bat coronaviruses
   tations, in isolation, were not noticeably deleterious to the       HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
   --
   able effect on the viral phenotype. A potentially related obser-    sented for the existence of an interaction between nsp9
   vation is that the mutation A2U, which is also neutral by itself,   nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
   is lethal in combination with the AACAAG insertion (data not        nsp7 has been found to bind to double-stranded RNA. The

我想做的是把它分成两部分：左和右。我在制作一个正则表达式或任何其他方法时遇到麻烦，这些方法会分割出明显在视觉上分裂的文本块，但对编程语言来说并不明显。线条的长度是可变的。

我考虑过寻找第一个区块，然后通过寻找多个空间找到第二个区块，但我不确定这是一个强大的解决方案。任何想法，片段，伪代码，链接等？

文字来源

enter image description here

该文本已通过pdftotext pdftotext -layout MyPdf.pdf

运行如下

Answer 1

我怀疑是否有任何可靠的解决方案。我会采用某种启发式方法。

在我的脑海中，我会计算每个单词第一个字符的列索引的直方图，并在具有最高分数的列上分割（想法是找到大量全部水平对齐的单词））。我也可以选择根据前面的空格数来加权。

Answer 2

模糊文本并为每列文本提供一个字符密度数组。然后寻找差距并在那里分裂。

String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");

int maxRowLength = 0;
for (String blurredLine : blurredLines) {
  maxRowLength = Math.max(maxRowLength, blurredLine.length());
}

int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
  for (int i = 0, n = blurredLine.length(); i < n; ++i) {
    if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; } 
  }
}    

// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.

int minBreakLen = 3;  // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
  if (columnCounts[i] != 0) { continue; }
  int runLength = 1;
  while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
    ++runLength;
  }
  if (runLength >= minBreakLen) {
    breaks.add(i);
  }
  i += runLength - 1;
}

System.out.println(breaks);

Answer 3

我在这个一般领域工作。令我感到惊讶的是，近期的双柱生物科学文本（SARS等）将以双列等宽字体的形式呈现为原始文本 - 它将以比例字体或HTML格式排版。所以我怀疑你的文字来自其他一些格式（如PDF）。如果是这样，那么你应该尝试获得这种格式。 PDF解析起来很糟糕，但是平面化为等宽的PDF可能更糟糕。

如果你可能找到在该地区工作过的人，看看他们做了什么。如果您有多个文件（例如来自不同的期刊或报告），那么您的问题就更糟了。是的，我可以写一个算法来解决你发布的例子，但我的猜测是它会破坏下一组文件。你最终会为每个不同的来源定制这个（我和其他人必须这样做）。

更新：谢谢。因为它是PDF然后我会开始询问周围。我们与宾夕法尼亚州的团队合作（他们也做过Citeseer）。我也有剑桥的同事花了几个月的PDF阅读器。

如果你想自己做 - 而且需要时间 - 那么我将从PDFBox开始。我已经做了很多这个，我认为这比pdf2text或pdftotext更好。我不记得它是否有双列选项 - 我想是的

更新以下是解决双列PDF的几种方法的最新答案 http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf 我当然会看到其他人做过的事情。

FWIW我花了很多时间试图说服人们科学家们不应该用PDF来创建他们的输出，因为它会破坏机器解析 - 正如你和我找到的那样

更新。你从你的PI（==首席调查员？）获得PDF？在这种情况下，你会得到许多不同的来源，这会使情况变得更糟。

您要解决的真正问题是什么？我也许可以帮忙

在java中拆分文本的可视块

文字来源

3 个答案: