Python web scraping,符号含义

时间:2016-08-10 14:50:49

标签: python web-scraping

在下面的代码中,符号字符串re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)的每个元素是什么意思?

import urllib2
import re

htmltext = urllib2.urlopen("https://en.wikipedia.org/wiki/Linkin_Park")
htmlread = htmltext.read()
htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)
regex = '(?<=Linkin Park was founded)(.*)(?=the following year.)'
pattern = re.compile(regex)
htmlread = re.findall(pattern, htmlread)
print "Linkin Park was founded" + htmlread[0] + "the following year."

2 个答案:

答案 0 :(得分:0)

htmlread = re.sub('<[^>]*>|[\n]|\[[0-9]*\]', '', htmlread)行删除了

  • <> OR
  • 之间的表达式
  • 换行
  • 括号或空括号之间的数字

来自htmlread

有趣的wiki帖子:Reference - What does this regex mean?

答案 1 :(得分:0)

用&#39;&#39;替换每个字符,这意味着将其从htmlread变量

中删除

请阅读有关RegEx的更多信息