如何拆分多个连接的单词?

时间:2008-10-12 02:37:20

标签: string nlp

我有1000个左右的数组,下面有例子:

wickedweather
liquidweather
driveourtrucks
gocompact
slimprojector

我希望能够将这些分成各自的词,如:

wicked weather
liquid weather
drive our trucks
go compact
slim projector

我希望有一个正则表达式,我可以做到这一点。但是,既然没有边界可以停下来,也没有任何我可以关键的大写,我想,有些类型的字典引用可能是必要的吗?

我想它可以手工完成,但为什么 - 可以用代码完成! =)但这让我很难过。有任何想法吗?

14 个答案:

答案 0 :(得分:70)

Viterbi algorithm要快得多。它在上面的Dmitry答案中计算与递归搜索相同的分数,但是在O(n)时间内。 (德米特里的搜索需要指数时间;维特比通过动态编程来完成。)

import re
from collections import Counter

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

测试它:

>>> viterbi_segment('wickedweather')
(['wicked', 'weather'], 5.1518198982768158e-10)
>>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0])
'its easy for me to split long run together blocks'

为了实用,你可能需要一些改进:

  • 添加概率日志,不要乘以概率。这可以避免浮点下溢。
  • 您的输入通常会使用不在您的语料库中的字词。必须将这些子字符串分配为非零概率作为单词,否则最终没有解决方案或不良解决方案。 (对于上述指数搜索算法也是如此。)这个概率必须从语料库词的概率中抽走,并在所有其他候选词中合理分布:一般主题在统计语言模型中称为平滑。 (但是你可以通过一些相当粗糙的黑客来逃避。)这就是O(n)维特比算法打破了搜索算法的地方,因为考虑非语料库词会炸掉分支因子。

答案 1 :(得分:33)

人类能做到吗?

farsidebag
far sidebag
farside bag
far side bag

你不仅需要使用字典,而且可能必须使用统计方法来找出最有可能的(或者,上帝禁止,选择人类语言的实际HMM ......)

关于如何进行可能有用的统计数据,我转向Peter Norvig博士,他解决了21行代码 中不同但相关的拼写检查问题 < /强>: http://norvig.com/spell-correct.html

(他通过将每个for循环折叠成一行而做了一些作弊......但仍然。)

更新 这一直困在我的脑海中,所以今天我不得不生下它。此代码与Robert Gamble描述的代码类似,但它会根据提供的字典文件中的字频率对结果进行排序(现在预计会出现一些代表您的域名或一般英语的文本。我使用了大字典。来自Norvig的.txt,与上面相关联,并在其上写了一本词典,以涵盖遗漏的单词)。

除非频率差异很大,否则大多数情况下,两个单词的组合将大多数情况下击败3个单词的组合。


我发布此代码时在我的博客上进行了一些小修改

http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/ 并且还写了一些关于此代码中的下溢错误。我很想安静地修复它,但想到这可能会帮助一些以前没有看过日志技巧的人: http://squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/


输出你的单词,加上我自己的一些 - 注意“orcore”会发生什么:

perl splitwords.pl big.txt words
answerveal: 2 possibilities
 -  answer veal
 -  answer ve al

wickedweather: 4 possibilities
 -  wicked weather
 -  wicked we at her
 -  wick ed weather
 -  wick ed we at her

liquidweather: 6 possibilities
 -  liquid weather
 -  liquid we at her
 -  li quid weather
 -  li quid we at her
 -  li qu id weather
 -  li qu id we at her

driveourtrucks: 1 possibilities
 -  drive our trucks

gocompact: 1 possibilities
 -  go compact

slimprojector: 2 possibilities
 -  slim projector
 -  slim project or

orcore: 3 possibilities
 -  or core
 -  or co re
 -  orc ore

代码:

#!/usr/bin/env perl

use strict;
use warnings;

sub find_matches($);
sub find_matches_rec($\@\@);
sub find_word_seq_score(@);
sub get_word_stats($);
sub print_results($@);
sub Usage();

our(%DICT,$TOTAL);
{
  my( $dict_file, $word_file ) = @ARGV;
  ($dict_file && $word_file) or die(Usage);

  {
    my $DICT;
    ($DICT, $TOTAL) = get_word_stats($dict_file);
    %DICT = %$DICT;
  }

  {
    open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n";

    foreach my $word (<$WORDS>) {
      chomp $word;
      my $arr = find_matches($word);


      local $_;
      # Schwartzian Transform
      my @sorted_arr =
        map  { $_->[0] }
        sort { $b->[1] <=> $a->[1] }
        map  {
          [ $_, find_word_seq_score(@$_) ]
        }
        @$arr;


      print_results( $word, @sorted_arr );
    }

    close $WORDS;
  }
}


sub find_matches($){
    my( $string ) = @_;

    my @found_parses;
    my @words;
    find_matches_rec( $string, @words, @found_parses );

    return  @found_parses if wantarray;
    return \@found_parses;
}

sub find_matches_rec($\@\@){
    my( $string, $words_sofar, $found_parses ) = @_;
    my $length = length $string;

    unless( $length ){
      push @$found_parses, $words_sofar;

      return @$found_parses if wantarray;
      return  $found_parses;
    }

    foreach my $i ( 2..$length ){
      my $prefix = substr($string, 0, $i);
      my $suffix = substr($string, $i, $length-$i);

      if( exists $DICT{$prefix} ){
        my @words = ( @$words_sofar, $prefix );
        find_matches_rec( $suffix, @words, @$found_parses );
      }
    }

    return @$found_parses if wantarray;
    return  $found_parses;
}


## Just a simple joint probability
## assumes independence between words, which is obviously untrue
## that's why this is broken out -- feel free to add better brains
sub find_word_seq_score(@){
    my( @words ) = @_;
    local $_;

    my $score = 1;
    foreach ( @words ){
        $score = $score * $DICT{$_} / $TOTAL;
    }

    return $score;
}

sub get_word_stats($){
    my ($filename) = @_;

    open(my $DICT, '<', $filename) or die "unable to open $filename\n";

    local $/= undef;
    local $_;
    my %dict;
    my $total = 0;

    while ( <$DICT> ){
      foreach ( split(/\b/, $_) ) {
        $dict{$_} += 1;
        $total++;
      }
    }

    close $DICT;

    return (\%dict, $total);
}

sub print_results($@){
    #( 'word', [qw'test one'], [qw'test two'], ... )
    my ($word,  @combos) = @_;
    local $_;
    my $possible = scalar @combos;

    print "$word: $possible possibilities\n";
    foreach (@combos) {
      print ' -  ', join(' ', @$_), "\n";
    }
    print "\n";
}

sub Usage(){
    return "$0 /path/to/dictionary /path/to/your_words";
}

答案 2 :(得分:8)

这里工作的最佳工具是递归,而不是正则表达式。基本思路是从字符串的开头开始寻找一个单词,然后取出字符串的其余部分并查找另一个单词,依此类推,直到到达字符串的末尾。递归解决方案很自然,因为当字符串的给定余数不能被分成一组单词时,需要进行回溯。下面的解决方案使用字典来确定什么是单词并在找到它们时打印出解决方案(一些字符串可以分解成多个可能的单词集,例如wickedweather可以被解析为“邪恶的我们在她身上”)。如果您只需要一组单词,则需要确定选择最佳集合的规则,可能选择最少字数的解决方案或设置最小字长。

#!/usr/bin/perl

use strict;

my $WORD_FILE = '/usr/share/dict/words'; #Change as needed
my %words; # Hash of words in dictionary

# Open dictionary, load words into hash
open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n";
while (<WORDS>) {
  chomp;
  $words{lc($_)} = 1;
}
close(WORDS);

# Read one line at a time from stdin, break into words
while (<>) {
  chomp;
  my @words;
  find_words(lc($_));
}

sub find_words {
  # Print every way $string can be parsed into whole words
  my $string = shift;
  my @words = @_;
  my $length = length $string;

  foreach my $i ( 1 .. $length ) {
    my $word = substr $string, 0, $i;
    my $remainder = substr $string, $i, $length - $i;
    # Some dictionaries contain each letter as a word
    next if ($i == 1 && ($word ne "a" && $word ne "i"));

    if (defined($words{$word})) {
      push @words, $word;
      if ($remainder eq "") {
        print join(' ', @words), "\n";
        return;
      } else {
        find_words($remainder, @words);
      }
      pop @words;
    }
  }

  return;
}

答案 3 :(得分:3)

我认为你认为这不是一个正常表达的工作。我会使用字典的想法来解决这个问题 - 寻找字典中单词的最长前缀。当你找到它时,把它切断并对字符串的其余部分做同样的事情。

上述方法存在歧义,例如“drivereallyfast”首先会找到“driver”,然后遇到“eallyfast”的问题。如果遇到这种情况,你还需要做一些回溯。或者,由于您没有那么多要拆分的字符串,只需手动执行自动拆分失败的字符串。

答案 4 :(得分:1)

嗯,问题本身不能用正则表达式来解决。解决方案(可能不是最好的)是获取字典并对字典中的每个工作进行正则表达式匹配到列表中的每个单词,只要成功就添加空格。当然,这不会非常快,但是编程和比手工操作更快。

答案 5 :(得分:1)

需要基于字典的解决方案。如果你有一个有限的单词词典,这可能会有所简化,否则构成其他单词前缀的单词将成为一个问题。

答案 6 :(得分:1)

This is related to a problem known as identifier splitting or identifier name tokenization. In the OP's case, the inputs seem to be concatenations of ordinary words; in identifier splitting, the inputs are class names, function names or other identifiers from source code, and the problem is harder. I realize this is an old question and the OP has either solved their problem or moved on, but in case someone else comes across this question while looking for identifier splitters (like I was, not long ago), I would like to offer Spiral ("SPlitters for IdentifieRs: A Library"). It is written in Python but comes with a command-line utility that can read a file of identifiers (one per line) and split each one.

Splitting identifiers is deceptively difficult. Programmers commonly use abbreviations, acronyms and word fragments when naming things, and they don't always use consistent conventions. Even in when identifiers do follow some convention such as camel case, ambiguities can arise.

Spiral implements numerous identifier splitting algorithms, including a novel algorithm called Ronin. It uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin can split identifiers that do not use camel case or other naming conventions, including cases such as splitting J2SEProjectTypeProfiler into [J2SE, Project, Type, Profiler], which requires the reader to recognize J2SE as a unit. Here are some more examples of what Ronin can split:

# spiral mStartCData nonnegativedecimaltype getUtf8Octets GPSmodule savefileas nbrOfbugs
mStartCData: ['m', 'Start', 'C', 'Data']
nonnegativedecimaltype: ['nonnegative', 'decimal', 'type']
getUtf8Octets: ['get', 'Utf8', 'Octets']
GPSmodule: ['GPS', 'module']
savefileas: ['save', 'file', 'as']
nbrOfbugs: ['nbr', 'Of', 'bugs']

Using the examples from the OP's question:

# spiral wickedweather liquidweather  driveourtrucks gocompact slimprojector
wickedweather: ['wicked', 'weather']
liquidweather: ['liquid', 'weather']
driveourtrucks: ['driveourtrucks']
gocompact: ['go', 'compact']
slimprojector: ['slim', 'projector']

As you can see, it is not perfect. It's worth noting that Ronin has a number of parameters and adjusting them makes it possible to split driveourtrucks too, but at the cost of worsening performance on program identifiers.

More information can be found in the GitHub repo for Spiral.

答案 7 :(得分:1)

pip install wordninja

>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']

答案 8 :(得分:0)

我可能会因此而失败,但让秘书这样做

您将花费更多时间在字典解决方案上,而不是手动处理。此外,您不可能对解决方案有100%的信心,因此无论如何您仍然需要手动注意。

答案 9 :(得分:0)

Santhosh thottingal发行了一个名为mlmorph的python软件包,可用于形态分析。

https://pypi.org/project/mlmorph/

示例:

from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")

给予

[('കേരളം<np><genitive>', 179)]

他还写了一个关于https://thottingal.in/blog/2017/11/26/towards-a-malayalam-morphology-analyser/

主题的博客

答案 10 :(得分:0)

使用Python的简单解决方案:安装wordsegment软件包:pip install wordsegment

$ echo thisisatest | python -m wordsegment
this is a test

答案 11 :(得分:0)

如果使用camelCase,这将起作用。 JavaScript !!!

function spinalCase(str) {
  let lowercase = str.trim()
  let regEx = /\W+|(?=[A-Z])|_/g
  let result = lowercase.split(regEx).join("-").toLowerCase()

  return result;
}

spinalCase("AllThe-small Things");

答案 12 :(得分:0)

其中一种解决方案可能是递归(可以将其转换为动态编程):

static List<String> wordBreak(
    String input,
    Set<String> dictionary
) {

  List<List<String>> result = new ArrayList<>();
  List<String> r = new ArrayList<>();

  helper(input, dictionary, result, "", 0, new Stack<>());

  for (List<String> strings : result) {
    String s = String.join(" ", strings);
    r.add(s);
  }

  return r;
}

static void helper(
    final String input,
    final Set<String> dictionary,
    final List<List<String>> result,
    String state,
    int index,
    Stack<String> stack
) {

  if (index == input.length()) {

    // add the last word
    stack.push(state);

    for (String s : stack) {
      if (!dictionary.contains(s)) {
        return;
      }
    }

    result.add((List<String>) stack.clone());

    return;
  }

  if (dictionary.contains(state)) {
    // bifurcate
    stack.push(state);
    helper(input, dictionary, result, "" + input.charAt(index),
           index + 1, stack);

    String pop = stack.pop();
    String s = stack.pop();

    helper(input, dictionary, result, s + pop.charAt(0),
           index + 1, stack);

  }
  else {
    helper(input, dictionary, result, state + input.charAt(index),
           index + 1, stack);
  }

  return;
}

另一种可能的解决方案是使用Tries数据结构。

答案 13 :(得分:0)

output :-
['better', 'good'] ['coffee', 'shop']
['coffee', 'shop']

    pip install wordninja
import wordninja
n=wordninja.split('bettergood')
m=wordninja.split("coffeeshop")
print(n,m)

list=['hello','coffee','shop','better','good']
mat='coffeeshop'
expected=[]
for i in list:
    if i in mat:
        expected.append(i)
print(expected)