Question

我有一些带有一些字符的字符串，我正在寻找这些字符的组织，以便它是最可能的。

例如，如果我有字母“ascrlyo”，那么有些安排会比其他安排更加明显。以下可能获得“高分”：

scaroly crasoly

以下情况可能得分较低：

oascrly yrlcsoa

我可以使用简单的算法吗？或者更好的是，Python功能可以实现这一目标吗？

谢谢！

Answer 1

首先解决一个更简单的问题：一个给定的单词是否可以发音？

机器学习＆监督学习＆＃39;可以在这里有效。在字典单词和加扰单词的训练集上训练二进制分类器（假设加扰的单词都是不可发音的）。对于功能，我建议计算双字母和三元组。我的理由是：不可发音的三卦，例如＆＃39; tns＆＃39;和＆＃39; srh＆＃39;在词典中很少见，即使每个字母都是常见的。

这个想法是，经过训练的算法将学会将任何罕见的三卦词分类为不可发音的词，并且将只有三卦的词分类为可发音。

这是scikit-learn http://scikit-learn.org/

的实现

import random
def scramble(s):
    return "".join(random.sample(s, len(s)))

words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]

X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
    ])

text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

from sklearn import metrics
print(metrics.classification_report(y_test, predicted))

准确度达到92％。无论如何，如果说明性是主观的，这可能会得到很好的效果。

                 precision    recall  f1-score   support

      scrambled       0.93      0.91      0.92     52409
           word       0.92      0.93      0.93     52934

    avg / total       0.92      0.92      0.92    105343

它同意你的例子：

>>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
['word', 'word', 'unpronounceable', 'unpronounceable']

对于好奇的人来说，这里有10个混乱的单词，它可以分类：

moro garapm ocenfir onerixoatteme arckinbo raetomoporyo bheral accrene cchmanie suroatipsheq

最后，10个词典词被错误分类为不可启动的：

ilch tohubohu usnea halfpaced pyrostilpnite lynnhaven cruel enure moldproof piecemeal

Answer 2

（为了完整起见，这是我原来的纯Python解决方案，它启发我尝试机器学习。）

我同意一个可靠的解决方案需要一个复杂的英语模型，但也许我们可以提出一个简单的启发式方法，这个方法很糟糕。

我可以想到大多数可发音词语满足的两个基本规则：

c?c?(v+cc?)*v*

作为正则表达式，可以写成vowels = "a e i o u y".split() consonants = "b bl br c ch cr chr cl ck d dr f fl g gl gr h j k l ll m n p ph pl pr q r s sc sch sh sl sp st t th thr tr v w wr x y z".split()

现在简单地尝试识别拼写中的声音：

v = "({0})".format("|".join(vowels))
c = "({0})".format("|".join(consonants))

import re
pattern = re.compile("^{1}?{1}?({0}+{1}{1}?)*{0}*$".format(v, c))
def test(w):
    return re.search(pattern, w)

def predict(words):
    return ["word" if test(w) else "scrambled" for w in words]

然后可以使用正则表达式的规则：

             precision    recall  f1-score   support

  scrambled       0.90      0.57      0.70     52403
       word       0.69      0.93      0.79     52940

avg / total       0.79      0.75      0.74    105343

单词/乱序测试集的得分约为74％。

- Evaluation of ROUGE-N (unigram, bigrams, trigrams, etc)
- Stemming for different languages
- Stopword removal with customizable stop words
- Evalutation of unicode texts (e.g. Persian)
- Synonyms capture for better agreement between system and reference summaries [English only and requires WordNet installation]
- Evaluation of specific parts of speech (e.g. NN) [Uses Stanford POS Tagger]

调整后的版本得分为80％。

以最明显的方式安排信件？

2 个答案: