Python re.findall发现奇怪的错误模式

时间:2019-06-06 15:32:50

标签: python regex

我通常很好奇为什么re.findall将sutch weid用作发现空字符串,元组(这意味着什么)。 ALS O解释说,它似乎通常不需要crassures()。像Ab一样的错误| cd是(ab)| (cd),而不是您通常认为的(b | c)d。因此,我无法定义正则表达式。
但是在此示例中,即在简单模式上看到了明显的错误行为:

([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}

其中描述了简单的网址(例如gskinner.com,www.capitolconnection.org),您可以在https://regexr.com/的正则表达式帮助中看到什么,我通过re.findall认识到:

hotmail.
living.
item.
2.
4S.

表示字母,然后就是。这个怎么可能?

完整代码,我尝试从文本中过滤掉垃圾:

import re

singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'


digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'



#small_word = '[a-zA-Z0-9]{1,3}'

#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'


email = singles + '\S+@\S*'






http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'

http = '(http|https|www)' + http_str

web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'


pat = email + '|' + digits_str

d_pat = re.compile(web_address)

text =  '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama@hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
 directly to the vendor for any bills pre 4/20.  I will fax you copies.  I will also try and get the payphone transferred.

www.capitolconnection.org <http://www.capitolconnection.org>.

and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''


print('findall:')

for x in re.findall(d_pat,text):
    print(x)


print('split:')
for x in re.split(d_pat,text):
    print(x)

2 个答案:

答案 0 :(得分:2)

摘自re.findall的文档:

  

如果模式中存在一个或多个组,则返回一个组列表;如果模式包含多个组,则这将是一个元组列表。

您的正则表达式具有组,即括号中的部分。如果要显示 entire 匹配项,请将正则表达式放在一个大的组中(在整个内容中加上括号),然后执行 var input = document.getElementById("userinput"); var enter = document.getElementById("enter"); var ul = document.querySelector("ul"); function addListItemOnClick() { if (input.value.length > 0) { var li = document.createElement("li"); li.appendChild(document.createTextNode(input.value)); ul.appendChild(li); input.value = " "; } } function addListItemOnPress() { if (input.value.length > 0 && event.keyCode === 13) { var li = document.createElement("li"); li.appendChild(document.createTextNode(input.value)); ul.appendChild(li); input.value = " "; } } enter.addEventListener("click", addListItemOnClick); input.addEventListener("keypress", addListItemOnPress); var liItem = document.querySelectorAll("li"); for (var i = 0; i < liItem.length; i++) { liItem[i].addEventListener("click", function() { console.log(liItem[i]); // prints undefined }); // console.log(liItem[i]); //works fine } 而不是print(x[0])

答案 1 :(得分:0)

我猜想我们的表达式必须在这里修改,例如,如果我们希望匹配所需的模式,那可能就是问题所在,我们将从类似以下的表达式开始

([a-zA-Z0-9]+)\.

如果我们希望在.之后有1到3个字符,我们可以将其扩展为:

([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?

Demo 1

Demo 2

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?"

test_str = ("hotmail.\n"
    "living.\n"
    "item.\n"
    "2.\n"
    "4S.\n"
    "hotmail.com\n"
    "living.org\n"
    "item.co\n"
    "2.321\n"
    "4S.123")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.