我有一个RSS提要解析器,我正在使用Regex来清理标签。我在使用reg4清理所有'chars时遇到问题,我想知道我能用reg4做什么?:
reg1 = re.compile(r'<br />') #Regex to replace <br /> with \n (see reg1.sub)
reg2 = re.compile(r'(<!--.*?-->|<[^>]*>)') #Regex to clean all html tags (anything with <something>)
reg3 = re.compile(r' ') #Regex to clean all  
reg4 = re.compile(r'') #Regex to clean all ' chars (this is causing me issues for some reason)
def parseFeeds( str ):
d = feedparser.parse(str)
print "There are", len(d['items']), "items in", str
FILE_INPUT = open("outputNewsFeed.txt","w")
for item in d['items']:
first_filter = reg1.sub('\n', item.description)
second_filter = reg2.sub('', first_filter)
third_filter = reg3.sub(' ', second_filter)
item_description = reg4.sub('', third_filter)
try:
FILE_INPUT.write(item_description)
except IOError:
print "Error: can\'t find file or read data"
FILE_INPUT.close
这是我目前的样本输出:
There are 25 items in http://www.reddit.com/r/python/.rss
[link] [12 comments]submitted by rasbt
[link] [comment]submitted by iamsidd2k7
[link] [comment]submitted by josephturnip2
[link] [28 comments]submitted by Maslo59
[link] [1 comment]The Source code isn't wonderful (I'm only a hobbyist, no were near a pro) but I use this whenever I'm at my desktop, and need to make some kind of decision or choose between two things, its sort of based off my unsure nature, lol.
答案 0 :(得分:1)
如果你只需要删除单引号,你可以像这样逃避它:
reg4 = re.compile(r'\'')
或者,如果您不介意改变编写字符串的方式,可以使用:
reg4 = re.compile(r"'")