Question

在下面的代码中，符号字符串re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)的每个元素是什么意思？

import urllib2
import re

htmltext = urllib2.urlopen("https://en.wikipedia.org/wiki/Linkin_Park")
htmlread = htmltext.read()
htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)
regex = '(?<=Linkin Park was founded)(.*)(?=the following year.)'
pattern = re.compile(regex)
htmlread = re.findall(pattern, htmlread)
print "Linkin Park was founded" + htmlread[0] + "the following year."

Answer 1

第htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)行删除了

<> OR
换行
括号或空括号之间的数字

来自htmlread

有趣的wiki帖子：Reference - What does this regex mean?

Answer 2

用＆＃39;＆＃39;替换每个字符，这意味着将其从htmlread变量

中删除

请阅读有关RegEx的更多信息

Python web scraping，符号含义

2 个答案: