分割文字而不分隔例如'纽约'

时间:2018-09-28 13:32:49

标签: python

我知道如何将字符串分成单词列表,如下所示:

module.exports = message => {
// Define client
const Discord = require("discord.js");
const client = message.client;
// Check who, and the prefix being used

if (message.author.id === "") return;
if (!message.content.startsWith(client.settings.prefix)) return;
// Define command
const command = message.content
    .split(" ")[0]
    .slice(client.settings.prefix.length)
    .toLowerCase();
// Define command paramaters
const params = message.content.split(" ").slice(1);
let cmd;
if (client.commands.has(command)) {
    cmd = client.commands.get(command);
}
// If command, run that command
if (cmd) {

    cmd.run(client, message, params);

}

};

但是,某些单词不应分开,例如“ New York”和“ Park Meadows Mall”。我已经将这种特殊情况保存在名为“ some_list”的列表中:

some_string = "Siva is belongs to new York and he was living in park meadows mall apartment "
some_string.split()
# ['Siva', 'is', 'belongs', 'to', 'new', 'York', 'and', 'he', 'was', living', 'in', 'park', 'meadows', 'mall', 'apartment']

期望的结果将是:

some_list = [('new York'), ('park meadows mall')]

关于如何完成此工作的任何想法?

1 个答案:

答案 0 :(得分:0)

您可以将拆分后的元素重构为它们的复合形式。理想情况下,您只希望扫描一次拆分后的字符串,并检查所有可能的替换项是否存在。

幼稚的方法也将some_list转换为给定一个单词的所有可能序列的查找表。例如,元素'new'表示可能替换'new', 'York'。可以通过拆分每个复合词的第一个词来构建这样的表:

replacements = {}
for compound in some_list:
    words = compound.split()  # 'new York' => 'new', 'York'
    try:                      # 'new' => [('new', 'York'), ('new', 'Orleans')]
        replacements[words[0]] = [words]
    except KeyError:          # 'new' => [('new', 'York')]
        replacements[words[0]].append(words)

使用此方法,您可以遍历拆分后的字符串并测试每个单词是否可能是复合单词的一部分。棘手的部分是避免添加复合词的结尾部分。

splitted_string = some_string.split()
compound_string = []
insertion_offset = 0
for index, word in enumerate(splitted_string):
    # we already added a compound string, skip its members
    if len(compound_string) + insertion_offset > index:
        continue
    # check if a compound word starts here...
    try:
        candidate_compounds = replacements[word]
    except KeyError:
        # definitely not, just keep the word
        compound_string.append(word)
    else:
        # try all possible compound words...
        for compound in candidate_compounds:
            if splitted_string[index:index+len(compound)] == compound:
                insertion_offset += len(compound)
                compound_string.append(' '.join(compound))
                break
        # ...but otherwise, just keep the word
        else:
            compound_string.append(word)

这将把所有的复合词缝合在一起:

 >>> print(compound_string)
 ['Siva', 'is', 'belongs', 'to', 'new York', 'he', 'was', 'living', 'in', 'park meadows mall']

请注意,replacements表的理想结构取决于您在some_list中的用词。如果第一个单词没有冲突,则可以跳过复合词 s 的列表,每个仅包含一个复合词。如果有很多冲突,则可能必须在其中嵌套几个表,以避免必须尝试所有候选对象。如果some_string大,则后者尤其重要。