Question

我是python正则表达式的初学者

目标test.php代码：

<html>
  <head></head> 
  <body>
    <a href="www.google.com">josn2051@yahoo.com.tw</a>
    <div>john@yahoo.com.tw</div>
    testtest321@gmail.com
    chorm3636@test.test.test.com
  </body>
</html>

这是我的代码：

import requests,re

email_pattern = re.compile('([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)')

res = requests.get("http://127.0.0.1/test.php")

a = email_pattern.findall(res.text)

print a

结果：

[（u'josn2051@yahoo.com.tw'，u'com。'），（u'john @ yahoo.com.tw'，u'com。'），（u'asdfFGw @ gmail.com'，u'gmail。'），（u'chorm3636@test.test.test.com'， u'test。'）]

但我希望结果如下：

[josn2051@yahoo.com.cn,john@yahoo.com.us,testtest321@gmail.com,chorm3636@test.test.test.com]

我的模式或代码有什么问题？

为什么结果是多个列表包含额外的com，gmail，test？

谢谢你解决我的疑惑！

Answer 1

第一条规则是你永远不会使用正则表达式来解析HTML，这是不可能做到的！

一旦你有一段你想要验证的文本和电子邮件地址，你就可以google并在StackOverlfow上找到2-5非常好的正则表达式。 RegExps不是特定于python的。

第三，你寻找更好的工作，试图从网站上废弃电子邮件地址并不是一件容易的事，这里的每个人都讨厌那些垃圾邮件给我们的人。

Answer 2

制作内部小组non-capturing：

([\w\-\.]+@(?:\w[\w\-]+\.)+[\w\-]+)
            ^^

Python使用正则表达式

2 个答案: