任何帮助,为什么这个正则表达不是&#39;匹配<td>\n
等?我在pythex.org上成功测试了它。基本上我只是试图清理输出,所以它只是说myfile.doc
。我也试过(<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]
答案 0 :(得分:3)
在没有看到repr(filename)
的情况下很难分辨,但我认为您的问题是真正的换行符与转义的换行符混淆。
比较和对比以下示例:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'
答案 1 :(得分:0)
如果您的目标只是从<td>
标记中获取已删除的字符串,则可以通过获取标记的stripped_strings
属性让BeautifulSoup为您执行此操作:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
如果要从相同类型的标签中提取更多字符串,可以使用findNext
在当前标签之后提取下一个td标记:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
然后循环......