将两列文本文档转换为单行以进行文本挖掘

时间:2017-06-01 03:21:14

标签: r perl

我使用pdftools从pdf中提取了文本,并将结果保存为txt。

是否有一种有效的方法将带有2列的txt转换为具有一列的文件。

这是我所拥有的一个例子:

Alice was beginning to get very      into the book her sister was reading,
tired of sitting by her sister       but it had no pictures or conversations
on the bank, and of having nothing   in it, `and what is the use of a book,' 
to do: once or twice she had peeped  thought Alice `without pictures or conversation?`

而不是

    Alice was beginning to get very tired of sitting by her sister on the bank, and 
of having nothing to do: once or twice she had peeped into the book her sister was 
reading, but it had no pictures or conversations in it, `and what is the use of a 
book,' thought Alice `without pictures or conversation?'

基于Extract Text from Two-Column PDF with R我修改了函数以获得:

library(readr)    
trim = function (x) gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", x,  perl=TRUE)

QTD_COLUMNS = 2

read_text = function(text) {
  result = ''
  #Get all index of " " from page.
  lstops = gregexpr(pattern =" ",text)
  #Puts the index of the most frequents ' ' in a vector.
  stops = as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
  #Slice based in the specified number of colums (this can be improved)
  for(i in seq(1, QTD_COLUMNS, by=1))
  {
    temp_result = sapply(text, function(x){
      start = 1
      stop =stops[i] 
      if(i > 1)            
        start = stops[i-1] + 1
      if(i == QTD_COLUMNS)#last column, read until end.
        stop = nchar(x)+1
      substr(x, start=start, stop=stop)
    }, USE.NAMES=FALSE)
    temp_result = trim(temp_result)
    result = append(result, temp_result)
  }
  result
}

txt = read_lines("alice_in_wonderland.txt")

result = ''

for (i in 1:length(txt)) { 
  page = txt[i]
  t1 = unlist(strsplit(page, "\n"))      
  maxSize = max(nchar(t1))
  t1 = paste0(t1,strrep(" ", maxSize-nchar(t1)))
  result = append(result,read_text(t1))
}

result

但是有些文件没有运气。我想知道是否有更通用/更好的正则表达式来实现结果。

非常感谢提前!

2 个答案:

答案 0 :(得分:0)

如果两列中始终存在恒定宽度,则看起来像固定宽度文件:

dat <- read.fwf(textConnection(txt), widths=c(37,48), stringsAsFactors=FALSE)
gsub("\\s+", " ", paste(unlist(dat), collapse=" "))

将把它全部放在一个很长的字符串中:

[1] "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?"

答案 1 :(得分:0)

使用固定宽度的左列,我们可以将每一行分成前37个字符和其余字符,将这些字符添加到左右列的字符串中。例如,使用正则表达式

use warnings;
use strict;

my $file = 'two_column.txt'
open my $fh, '<', $file or die "Can't open $file: $!";

my ($left_col, $right_col);

while (<$fh>) 
{
    my ($left, $right) = /(.{37})(.*)/;

    $left =~ s/\s*$/ /;

    $left_col  .= $left;
    $right_col .= $right;
}
close $fh;

print $left_col, $right_col, "\n";

这会打印整个文本。或者加入列my $text = $left_col . $right_col;

正则表达式模式(.{37})匹配任何字符(.)并完全执行37次({37}),使用()捕获该字符。 (.*)捕获剩余的所有内容。这些由正则表达式返回并分配。 $left中的尾随空格被压缩为一个。然后附加两者(.=)。

或者从命令行

perl -wne'
    ($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r; 
     }{ print $cL,$cR,"\n"
' two_column.txt

其中}{启动END块,该块在退出之前运行(在处理完所有行之后)。