在Python中使用正则表达式定位回车和换行符的问题

时间:2013-06-16 08:10:11

标签: python regex python-2.7 newline carriage-return

我有一些代码是在一个更大的程序中独立工作,但是现在在更大的程序中它似乎没有工作 - 即。它没有执行所需的操作。

问题发生在第4步(见下文),并且在反射时,我在字符类中的预期逻辑(即“除了回车符之外的所有内容”)似乎没有被正确编码(但我没有知道如何'短语'逻辑)。

我的目标是用段落标记包装每一行或每段。

Python代码

import re

# 1.  open the html file in read mode
html_file = open('test.html', 'r')

# 2.  convert to string
html_file_as_string = html_file.read()

# 3.  close the html file
html_file.close()

# 4.  replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r]*)\r', r'\1</p>\n<p>', html_file_as_string)

# 5.  remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)

# 6.  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 7.  remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)

# 8.  open the file in write mode to clear
html_file = open('test.html', 'w')

# 9.  write the new contents to file
html_file.write(html_file_as_string)

# 10.  print to screen so we can see what is happening
print html_file_as_string

# 11.  close the html file
html_file.close()

以下是HTML文件的内容:

<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum..consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit."Lorem ipsum dolor sit amet", consectetur adipisc'ing elit.Lorem ipsum dolor...sit amet, consectetur adipiscing elit..
   Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.
   Lorem ipsum dolor sit amet, consectetur adipiscing elit..
   .....Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum dolor sit amet, consectetur adipiscing elit.Lorem ipsum 01/01/05, 05:00 am</p>

以下是在SciTE编辑器中查看的文件内容(因此空格,回车符和新行都可见)。

enter image description here

修改

我根据以下建议更改了正则表达式,然后将替换次数加倍两次(从步骤4中可见的原始代码更改和步骤4之前的步骤6复制)。

工作代码:

import re

# 1.  open the html file in read mode
html_file = open('test.html', 'r')

# 2.  convert to string
html_file_as_string = html_file.read()

# 3.  close the html file
html_file.close()

# 6(added).  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 4(changed).  replace carriage returns with closing and opening paragraph tags
html_file_as_string = re.sub('([^\r\n]*)(\r\n?|\n)', r'\1</p>\2<p>', html_file_as_string)

# 5.  remove time and date
html_file_as_string = re.sub(r'(Lorem ipsum \d*/\d*/\d*, \d*:\d* [a-z]{2})', r"", html_file_as_string)

# 6.  remove the white space after the opening paragraph tags
html_file_as_string = re.sub('<p>\n*\s*', r"<p>", html_file_as_string)

# 7.  remove the white space before the closing paragraph tags
html_file_as_string = re.sub('\s*</p>', r"</p>", html_file_as_string)

# 8.  open the file in write mode to clear
html_file = open('test.html', 'w')

# 9.  write the new contents to file
html_file.write(html_file_as_string)

# 10.  print to screen so we can see what is happening
print html_file_as_string

# 11.  close the html file
html_file.close()

编辑2:

上面的代码在代码的其他部分过于激进,并做了太多修改,回到绘图板。

1 个答案:

答案 0 :(得分:1)

我认为你最好迭代字符串,而不是尝试使用正则表达式作为Python,你可以通过几个步骤来完成这个任务

import re
parsed_html = []

# Open the file and close it after being read
with open('test.html', 'r') as html_file:
    lines = html_file.readlines()

# Iterate through each line using \r\n, \r or \n as separators 
for line in lines:
    # Remove whitespace chars before and after the content (6 & 7)
    line = line.strip()

    # Skip empty lines (you could merge this and latter if)
    if not line:
         continue

    # Skip lines with only <p> or </p> (check output and remove if not needed)
    if re.match("^</?p>$", line):
         continue

    # Remove the date line (using + instead of * as * matches 0 or more entries)
    line = re.sub(r'Lorem ipsum \d+/\d+/\d+, \d+:\d+ (am|pm)', '', line)

    # Make sure we add p tag and CL/LR to the end of each line
    line = "<p>{0}</p>\r\n".format(line)

    # Append current line to a list that we will use to write the file
    parsed_html.append(line)

# Write contents of the parsed html to the file
with open('test.html', 'w') as html_file:
    html_file.writelines(parsed_html)