Question

我正在编写一个程序来分割主题标签中包含的单词。

例如，我想拆分主题标签：

What the hello go back

成：

import re,pdb

def func_replace(each_func):
    i=0
    wordsineach_func=[] 
    while len(each_func) >0:
        i=i+1
        word_found=longest_word(each_func)
        if len(word_found)>0:
            wordsineach_func.append(word_found)
            each_func=each_func.replace(word_found,"")
    return ' '.join(wordsineach_func)

def longest_word(phrase):
    phrase_length=len(phrase)
    words_found=[];index=0
    outerstring=""
    while index < phrase_length:
        outerstring=outerstring+phrase[index]
        index=index+1
        if outerstring in words or outerstring.lower() in words:
            words_found.append(outerstring)
    if len(words_found) ==0:
        words_found.append(phrase)
    return max(words_found, key=len)        

words=[]
# The file corncob_lowercase.txt contains a list of dictionary words
with open('corncob_lowercase.txt') as f:
    read_words=f.readlines()

for read_word in read_words:
    words.append(read_word.replace("\n","").replace("\r",""))

将re.sub与功能参数一起使用时，我遇到了麻烦。

我写的代码是：

s="#Whatthehello #goback"

#checking if the function is able to segment words
hashtags=re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])

# using the function for re.sub
print re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)

例如，当使用这样的函数时：

What the hello
#Whatthehello #goback

我获得的输出是：

What the hello
What the hello go back

这不是我预期的输出：

  <div id="bloodhound">
        <input class="typeahead" type="text" placeholder=" Search">
    </div>


<script>
        var result = new Bloodhound({
            datumTokenizer: Bloodhound.tokenizers.obj.whitespace('value'),
            queryTokenizer: Bloodhound.tokenizers.whitespace,

            remote: {
                url: 'https://api1.com/idocs/api',
                wildcard: '%QUERY',
                rateLimitWait: 300 ,
                transport: function (opts, onSuccess, onError) {
                    var url = opts.url;
                    $.ajax({
                        url: url,
                        type: "POST",
                        success: onSuccess,
                        error: onError,
                    });


                },
                filter: function (data) {
                    if (data) {
                        return $.map(data, function (object) {
                            return data.data.results.data;
                        });
                    } 
                }
            },
            dupDetector: function (remoteMatch, localMatch) {
                return remoteMatch.id === localMatch.id;
            }
        });
        result.initialize();
        $('input').typeahead(null, {
            name: 'result',
            displayKey: 'id',
            source: result.ttAdapter(),
            templates: {
                empty: ['<div>', 'no results found', '</div>'],
                suggestion: function (data) {
                    return '<p>' + data.basicinfo.object_name + '</p>';

                }

            },
        });

为什么会这样？特别是我使用了this answer的建议，但我不明白这段代码出了什么问题。

Answer 1

请注意m.group()返回匹配的整个字符串，无论它是否是捕获组的一部分：

In [19]: m = re.search(r"#(\w+)", s)

In [20]: m.group()
Out[20]: '#Whatthehello'

m.group(0)也会返回整场比赛：

In [23]: m.group(0)
Out[23]: '#Whatthehello'

相反，m.groups()会返回所有捕获组：

In [21]: m.groups()
Out[21]: ('Whatthehello',)

和m.group(1)返回第一个捕获组：

In [22]: m.group(1)
Out[22]: 'Whatthehello'

因此，代码中的问题源于

中使用m.group

re.sub(r"#(\w+)", lambda m: func_replace(m.group()), s)

因为

In [7]: re.search(r"#(\w+)", s).group()
Out[7]: '#Whatthehello'

如果您使用过.group(1)，那么您就会得到

In [24]: re.search(r"#(\w+)", s).group(1)
Out[24]: 'Whatthehello'

和前面的#完全不同：

In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'

In [26]: func_replace('Whatthehello')
Out[26]: 'What the hello'

因此，将m.group()更改为m.group(1)，并将/usr/share/dict/words替换为corncob_lowercase.txt，

import re

def func_replace(each_func):
    i = 0
    wordsineach_func = []
    while len(each_func) > 0:
        i = i + 1
        word_found = longest_word(each_func)
        if len(word_found) > 0:
            wordsineach_func.append(word_found)
            each_func = each_func.replace(word_found, "")
    return ' '.join(wordsineach_func)


def longest_word(phrase):
    phrase_length = len(phrase)
    words_found = []
    index = 0
    outerstring = ""
    while index < phrase_length:
        outerstring = outerstring + phrase[index]
        index = index + 1
        if outerstring in words or outerstring.lower() in words:
            words_found.append(outerstring)
    if len(words_found) == 0:
        words_found.append(phrase)
    return max(words_found, key=len)

words = []
# corncob_lowercase.txt contains a list of dictionary words
with open('/usr/share/dict/words', 'rb') as f:
    for read_word in f:
        words.append(read_word.strip())
s = "#Whatthehello #goback"
hashtags = re.findall(r"#(\w+)", s)
print func_replace(hashtags[0])
print re.sub(r"#(\w+)", lambda m: func_replace(m.group(1)), s)

打印

What the hello
What the hello gob a c k

因为，唉，'gob'比'go'长。

您可以调试此方法的一种方法是使用常规函数替换lambda函数，然后添加print语句：

def foo(m):
    result = func_replace(m.group())
    print(m.group(), result)
    return result

In [35]: re.sub(r"#(\w+)", foo, s)
('#Whatthehello', '#Whatthehello')   <-- This shows you what `m.group()` and `func_replace(m.group())` returns
('#goback', '#goback')
Out[35]: '#Whatthehello #goback'

那会把你的注意力集中在

上

In [25]: func_replace('#Whatthehello')
Out[25]: '#Whatthehello'

然后你可以与

进行比较

In [26]: func_replace(hashtags[0])
Out[26]: 'What the hello'

In [27]: func_replace('Whatthehello')
Out[27]: 'What the hello'

这会引导您提出问题，如果m.group()返回'#Whatthehello'，我需要返回'Whatthehello'的方法。潜入the docs然后解决问题。

在Python中使用函数作为re.sub的参数？

1 个答案: