忽略标记

时间:2018-06-04 16:33:25

标签: html regex

我不熟悉RegEx,我需要在<td> NEED HERE </td>之间提取所有信息。但是当它具有CSS属性时,我只需匹配标记<td>。我需要跳过它们,<table><tr><td>有或没有属性

<td[^>]*>

示例:

<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>

输出所需:

Hello, Output Status, 100%

在某些情况下,&amp; nbsp将介于这些标记之间,我也想跳过它们。

2 个答案:

答案 0 :(得分:2)

You'll want to use an HTML parser like BeautifulSoup. You mentioned that your backend was Python. If you don't already have it, you'll need to grab BeautifulSoup, just pip it like this:

pip install beautifulsoup4

This should give you exactly what you are looking for:

from bs4 import BeautifulSoup

html_doc = """
<p class="story">...</p>
<table height="100%" border="0" cellpadding="0" cellspacing="0">
<tr><td width="4" class="cll">Hello</td>
<td class="tex" nowrap>Output Status</td><td width="4" class="clr">100%</td></tr></table>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

td_list = soup.find_all('td')
td_list_text = []

for td in td_list:
    td_list_text.append(td.text)

my_string = ", ".join(td_list_text)
print(my_string)

Output:

Hello, Output Status, 100%

You can read more about the options available here: https://www.crummy.com/software/BeautifulSoup/

答案 1 :(得分:2)

Caveat upfront:

Using regexes on HTML is inherently error-prone, and many well-intentioned people will tell you to never do it ever. I generally recommend using an HTML parser like in sniperd's answer.

But for simple data extraction (e.g. no tag nesting) regexes are sometimes just fine:

extract_td_regex = re.compile(r"<td[\w\"'=\s]*>([^><]+)<\/td")

Lets break that down:

"<td"         # start td tag
"[\w\"'=\s]*" # match any word character, white space, =, ', " zero or more times
">"           # close opening td tag
"([^><]+)"    # capture group gets anything *not* > or <, 
"<\/td"       # closing td tag

The capture group will contain the inner td contents.

Here's the regex101

Note that this will fail if you have tags (like spans) inside the td's.