Question

我有一个RSS提要解析器，我正在使用Regex来清理标签。我在使用reg4清理所有'chars时遇到问题，我想知道我能用reg4做什么？：

reg1 = re.compile(r'<br />') #Regex to replace <br /> with \n (see reg1.sub)
reg2 = re.compile(r'(<!--.*?-->|<[^>]*>)') #Regex to clean all html tags (anything with <something>)
reg3 = re.compile(r'&nbsp') #Regex to clean all &nbsp 
reg4 = re.compile(r'') #Regex to clean all ' chars (this is causing me issues for some reason)

def parseFeeds( str ):
 d = feedparser.parse(str)
 print "There are", len(d['items']), "items in", str
 FILE_INPUT = open("outputNewsFeed.txt","w")
 for item in d['items']:
  first_filter = reg1.sub('\n', item.description)
  second_filter = reg2.sub('', first_filter)
  third_filter = reg3.sub(' ', second_filter)
  item_description = reg4.sub('', third_filter)
  try:
   FILE_INPUT.write(item_description)
  except IOError:
   print "Error: can\'t find file or read data"
 FILE_INPUT.close

这是我目前的样本输出：

There are 25 items in http://www.reddit.com/r/python/.rss

[link] [12 comments]submitted by  rasbt  
[link] [comment]submitted by  iamsidd2k7  
[link] [comment]submitted by  josephturnip2  
[link] [28 comments]submitted by  Maslo59  
[link] [1 comment]The Source code isn't wonderful (I'm only a hobbyist, no were near a pro) but I use this whenever I'm at my desktop, and need to make some kind of decision or choose between two things, its sort of based off my unsure nature, lol.

Answer 1

如果你只需要删除单引号，你可以像这样逃避它：

reg4 = re.compile(r'\'')

或者，如果您不介意改变编写字符串的方式，可以使用：

reg4 = re.compile(r"'")

RegEx清理所有'字符

1 个答案: