Question

我正在尝试制作python challange。 http://www.pythonchallenge.com/pc/def/ocr.html 好。我知道，我可以将代码从源代码复制粘贴到txt文件并制作类似的东西，但我想从网上获取它以改善自己。（+我已经完成了）我试过了

re.findall(r"<!--(.*?)-->,html)

但它没有得到任何东西。如果你想要我的完整代码在这里：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall(r"<!--(.*)-->",str(x.content))
print codes

我也试过这样做：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests,re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--\n(.*)\n-->",str(x.content))
print codes

现在它找到了文字，但仍然无法弄清楚：（

Answer 1

我会改用 HTML解析器。您可以使用find comments在HTML中BeautifulSoup。

工作代码：

import requests
from bs4 import BeautifulSoup, Comment


link = "http://www.pythonchallenge.com/pc/def/ocr.html"
response = requests.get(link)

soup = BeautifulSoup(response.content, "html.parser")

code = soup.find_all(text=lambda text: isinstance(text, Comment))[-1]
print(code.strip())

Answer 2

不确定你是什么意思＆＃34;那个混乱＆＃34;。您应该在此帖子中包含挑战的所有详细信息，而不是将用户链接到pythonchallenge帖子。

无论哪种方式，如果您将正则表达式设置为单行模式//s，那么点字符.也应与换行符/n匹配。这样可以避免正则表达式中的\n(.+)\n构造，这可能会解决您的问题。

这里是link to a working regex example。

这是修改后的python 2.7代码：

#!/usr/bin/python
import requests, re
link = "http://www.pythonchallenge.com/pc/def/ocr.html"
x = requests.get(link)
codes = re.findall("<!--(.*?)-->", str(x.content), re.S)
print codes[1]

请注意re.S，(.*?)和codes[1]修改。

re.S是//s
(.*?)使*量词非贪婪
codes[1]打印在HTML注释中找到的第二组内容（因为findall(..)匹配2并返回两组的数组）。

Answer 3

你可以解决：

codes = re.findall("/<!--(.*?)-->/s",str(x.content))

“s”用空格和分隔线找到

Python如何使用re

3 个答案: