Question

我想从一个相当大的表格中收集非结构化数据（约35万个观测值）。你会推荐什么策略？

假设我有以下数据库

|ID |                           Description                          | 
|12 | Mr A is thirty-five years old and works as an accountant in ...|
|34 | Mr B, 24 and has set up a retail business since 2004.          |
|55 | Mr C aged 58, lives in town A and has a hardware shop ...      |

...

我希望在每次观察中都能了解城镇和专业的年龄。（如果数据可用）。

我开始使用SAS并使用Perl类型的正则表达式。我花了很多时间构建正则表达式并捕获数据，但工作得相当好。我知道正则表达式可能不是最好的策略，但我想在观察数量增加时自动捕获大部分数据。

Answer 1

我一下子看到两个问题。一：提取结构化数据。二：以图形方式呈现。我会从One开始。

我认为以下不是一个确切的解决方案，并且不会赢得任何算法奖励，并且，对于350.000行，可能会花几个晚上运行。但是如果你想尝试这条路，这可能会给你一些提示。（但有些人提到，这可能是一条非常坎坷的道路，甚至是死路一条）

向表中添加几列，使用（class）DBI迭代行，添加单独的函数以尝试猜测每个参数。

参见例如PerlMonks用于一些有效的数据库更新。

#meta code alert
my $dbh= DBI->new('connect to a database');
my $sth = $dbh->prepare("SELECT ID, THETEXT FROM ATABLE");
$sth->execute();
while (my $row = $sth->fetchrow_hashref) {
    my $age = guess_age($row->{TEXT});
    if ($age > 0) {
        ...#update database
    }
}
#end meta

sub guess_age{
     my $text = shift,
     my $age;
     #look for text, any sequence of number words or - or <whitespace>\s 
     if ($text =~ /((?:one|two|three|...ninety|-|\s)+)/  ) {
        $age = some_number_from_text_function($1)
     #see if we have some prefix words in front of a number 
     } elsif ($text =~ /(?:age|aged)\s*(\d+)/ ) {
        $age = $1;
     #see if we have some postfix words after a number  
     } elsif ($text =~ /(\d+)\s*(?:old|of age|years)/ ) {
        $age = $1;
     #see if we have a comma early in the sentence, 
     } elsif ($text =~ /,\s*(\d+)/ ) {
            #this 'if' should been part of main elsif, as it may stop here:-(
        if ($-[0] <50) {#found before pos 50 in the text
            $age = $1;
        }
    } elsif (... ) {
    } else {
        $age = -1; #flag : not found?
     }
     return $age;
}

但同样，这可能是一个死胡同......

对于Town，我猜任何意想不到的大写都可能需要寻找/ [az] \ W（[AZ] \ w +）/＃即非上限字母后跟非字母，后跟资本+任何信件。对于职业，我真的没有线索。也许与许多职业的大哈希进行单词匹配？

如何从数据库中收集非结构化数据？

1 个答案: