我不熟悉RegEx,我需要在<td> NEED HERE </td>
之间提取所有信息。但是当它具有CSS属性时,我只需匹配标记<td>
。我需要跳过它们,<table><tr><td>
有或没有属性
<td[^>]*>
示例:
<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>
输出所需:
Hello, Output Status, 100%
在某些情况下,&amp; nbsp将介于这些标记之间,我也想跳过它们。
答案 0 :(得分:2)
You'll want to use an HTML parser like BeautifulSoup. You mentioned that your backend was Python. If you don't already have it, you'll need to grab BeautifulSoup, just pip it like this:
pip install beautifulsoup4
This should give you exactly what you are looking for:
from bs4 import BeautifulSoup
html_doc = """
<p class="story">...</p>
<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
td_list = soup.find_all('td')
td_list_text = []
for td in td_list:
td_list_text.append(td.text)
my_string = ", ".join(td_list_text)
print(my_string)
Output:
Hello, Output Status, 100%
You can read more about the options available here: https://www.crummy.com/software/BeautifulSoup/
答案 1 :(得分:2)
Using regexes on HTML is inherently error-prone, and many well-intentioned people will tell you to never do it ever. I generally recommend using an HTML parser like in sniperd's answer.
But for simple data extraction (e.g. no tag nesting) regexes are sometimes just fine:
extract_td_regex = re.compile(r"<td[\w\"'=\s]*>([^><]+)<\/td")
Lets break that down:
"<td" # start td tag
"[\w\"'=\s]*" # match any word character, white space, =, ', " zero or more times
">" # close opening td tag
"([^><]+)" # capture group gets anything *not* > or <,
"<\/td" # closing td tag
The capture group will contain the inner td contents.
Here's the regex101
Note that this will fail if you have tags (like span
s) inside the td's.