获得最少量的子词

时间:2016-04-09 18:38:51

标签: java nlp text-segmentation

Dávid Horváth的解决方案适合返回最小的单词:

import java.util.*;

public class SubWordsFinder
{
    private Set<String> words;

    public SubWordsFinder(Set<String> words)
    {
        this.words = words;
    }

    public List<String> findSubWords(String word) throws NoSolutionFoundException
    {
        List<String> bestSolution = new ArrayList<>();
        if (word.isEmpty())
        {
            return bestSolution;
        }
        long length = word.length();
        int[] pointer = new int[]{0, 0};
        LinkedList<int[]> pointerStack = new LinkedList<>();
        LinkedList<String> currentSolution = new LinkedList<>();
        while (true)
        {
            boolean backtrack = false;
            for (int end = pointer[1] + 1; end <= length; end++)
            {
                if (end == length)
                {
                    backtrack = true;
                }
                pointer[1] = end;
                String wordToFind = word.substring(pointer[0], end);
                if (words.contains(wordToFind))
                {
                    currentSolution.add(wordToFind);
                    if (backtrack)
                    {
                        if (bestSolution.isEmpty() || (currentSolution.size() <= bestSolution.size() && getSmallestSubWordLength(currentSolution) > getSmallestSubWordLength(bestSolution)))
                        {
                            bestSolution = new ArrayList<>(currentSolution);
                        }
                        currentSolution.removeLast();
                    } else if (!bestSolution.isEmpty() && currentSolution.size() == bestSolution.size())
                    {
                        currentSolution.removeLast();
                        backtrack = true;
                    } else
                    {
                        int[] nextPointer = new int[]{end, end};
                        pointerStack.add(pointer);
                        pointer = nextPointer;
                    }
                    break;
                }
            }
            if (backtrack)
            {
                if (pointerStack.isEmpty())
                {
                    break;
                } else
                {
                    currentSolution.removeLast();
                    pointer = pointerStack.removeLast();
                }
            }
        }
        if (bestSolution.isEmpty())
        {
            throw new NoSolutionFoundException();
        } else
        {
            return bestSolution;
        }
    }

    private int getSmallestSubWordLength(List<String> words)
    {
        int length = Integer.MAX_VALUE;

        for (String word : words)
        {
            if (word.length() < length)
            {
                length = word.length();
            }
        }

        return length;
    }

    public class NoSolutionFoundException extends Exception
    {
        private static final long serialVersionUID = 1L;
    }
}

我的String包含了小写的常规英语单词。假设此String已经分解为所有可能子词的List

public List<String> getSubWords(String text)
{
    List<String> words = new ArrayList<>();

    for (int startingIndex = 0; startingIndex < text.length() + 1; startingIndex++)
    {
        for (int endIndex = startingIndex + 1; endIndex < text.length() + 1; endIndex++)
        {
            String subString = text.substring(startingIndex, endIndex);

            if (contains(subString))
            {
                words.add(subString);
            }
        }
    }

    Comparator<String> lengthComparator = (firstItem, secondItem) ->
    {
        if (firstItem.length() > secondItem.length())
        {
            return -1;
        }

        if (secondItem.length() > firstItem.length())
        {
            return 1;
        }

        return 0;
    };

    // Sort the list in descending String length order
    Collections.sort(words, lengthComparator);

    return words;
}

如何找到构成原始字符串的最少量的子字?

例如:

String text = "updatescrollbar";
List<String> leastWords = getLeastSubWords(text);
System.out.println(leastWords);

输出:

[update, scroll, bar]

我不知道如何迭代所有可能性,因为它们会根据所选词语而改变。一个开始就是这样的:

public List<String> getLeastSubWords(String text)
{
    String textBackup = text;
    List<String> subWords = getSubWords(text);
    System.out.println(subWords);
    List<List<String>> containing = new ArrayList<>();

    List<String> validWords = new ArrayList<>();

    for (String subWord : subWords)
    {
        if (text.startsWith(subWord))
        {
            validWords.add(subWord);
            text = text.substring(subWord.length());
        }
    }

    // Did we find a valid words distribution?
    if (text.length() == 0)
    {
        System.out.println(validWords.size());
    }

    return null;
}

注意:
这与this问题有关。

1 个答案:

答案 0 :(得分:0)

这就是要求很多场景的可能性。

您的示例(updatescrollbar)已经up date ate update scroll bar并且仍然非常简单,但是如果您有一个单词作为子词会让您在字符串末尾留下单个字符的可能性怎么办? 。

因此,要完成此操作,您必须在子词列表中多次迭代,跟踪与您的文本匹配的当前最短有效版本,并继续迭代,直到您尝试了所有变体。

您可以减少变体数量,例如使用一种算法,将剩余的匹配空间考虑在内:

  • 根据长度对子词进行排序,并尝试首先使用最长的单词填充文本:length subword possible = text - / - 匹配的文本。 这将使用包含,因此要匹配的文本仍然可以在已匹配的单词之前和之后:对文本使用数组,以便更容易跟踪匹配的文本。