我通常很好奇为什么re.findall将sutch weid用作发现空字符串,元组(这意味着什么)。 ALS O解释说,它似乎通常不需要crassures()。像Ab一样的错误| cd是(ab)| (cd),而不是您通常认为的(b | c)d。因此,我无法定义正则表达式。
但是在此示例中,即在简单模式上看到了明显的错误行为:
([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}
其中描述了简单的网址(例如gskinner.com,www.capitolconnection.org),您可以在https://regexr.com/的正则表达式帮助中看到什么,我通过re.findall认识到:
hotmail.
living.
item.
2.
4S.
表示字母,然后就是。这个怎么可能?
完整代码,我尝试从文本中过滤掉垃圾:
import re
singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'
digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'
#small_word = '[a-zA-Z0-9]{1,3}'
#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'
email = singles + '\S+@\S*'
http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'
http = '(http|https|www)' + http_str
web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'
pat = email + '|' + digits_str
d_pat = re.compile(web_address)
text = '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama@hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
directly to the vendor for any bills pre 4/20. I will fax you copies. I will also try and get the payphone transferred.
www.capitolconnection.org <http://www.capitolconnection.org>.
and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''
print('findall:')
for x in re.findall(d_pat,text):
print(x)
print('split:')
for x in re.split(d_pat,text):
print(x)
答案 0 :(得分:2)
摘自re.findall
的文档:
如果模式中存在一个或多个组,则返回一个组列表;如果模式包含多个组,则这将是一个元组列表。
您的正则表达式具有组,即括号中的部分。如果要显示 entire 匹配项,请将正则表达式放在一个大的组中(在整个内容中加上括号),然后执行 var input = document.getElementById("userinput");
var enter = document.getElementById("enter");
var ul = document.querySelector("ul");
function addListItemOnClick() {
if (input.value.length > 0) {
var li = document.createElement("li");
li.appendChild(document.createTextNode(input.value));
ul.appendChild(li);
input.value = " ";
}
}
function addListItemOnPress() {
if (input.value.length > 0 && event.keyCode === 13) {
var li = document.createElement("li");
li.appendChild(document.createTextNode(input.value));
ul.appendChild(li);
input.value = " ";
}
}
enter.addEventListener("click", addListItemOnClick);
input.addEventListener("keypress", addListItemOnPress);
var liItem = document.querySelectorAll("li");
for (var i = 0; i < liItem.length; i++) {
liItem[i].addEventListener("click", function() {
console.log(liItem[i]); // prints undefined
});
// console.log(liItem[i]); //works fine
}
而不是print(x[0])
。
答案 1 :(得分:0)
我猜想我们的表达式必须在这里修改,例如,如果我们希望匹配所需的模式,那可能就是问题所在,我们将从类似以下的表达式开始
([a-zA-Z0-9]+)\.
如果我们希望在.
之后有1到3个字符,我们可以将其扩展为:
([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?"
test_str = ("hotmail.\n"
"living.\n"
"item.\n"
"2.\n"
"4S.\n"
"hotmail.com\n"
"living.org\n"
"item.co\n"
"2.321\n"
"4S.123")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.