Question

Pyhton初学者在这里。可能有一个我不知道但在网络上找不到解决方案的命令。我的Python设置中有一个字符串格式的html文件。该文件看起来像

<table>
This is Table 1
</table>

<table>
This is Table 2
</table>

<table>
This is Table 3
</table>

我想提取和之间的文本，但前提是它与表中的某些字符串匹配。因此，我只想要表2所示的表。

我尝试将文档拆分为表格，但这变得很混乱，因为它还包含</table> and <table>之间的部分。我知道命令re.search，但不知道如何将它与if语句结合使用。

re.search(<table>(.*)</table>

Answer 1

所以一个想法是通过BeautifulSoup获取html。然后，您可以像这样简单地访问标签：

library(dplyr)

dcl <- '07'
xdecil <- paste('detr0', dcl, sep='')
final_cust <- cd_probs %>% filter(final_prob>=xdecil)

然后您可以获取innerHtml并将其与您的字符串进行比较。这将以您可以使用BeautifulSoup访问html为前提。从https://www.pluralsight.com/guides/web-scraping-with-beautiful-soup

获得了

Answer 2

使用lxml解析器解决此问题。

from lxml import html

text = '''<table>This is Table 1</table>

<table>This is Table 2</table>

<table>This is Table 3</table>'''

parser = html.fromstring(text)
parser.xpath("//table[contains(text(), 'Table 2')]/text()")

输出将如下所示

['This is Table 2']

如果html表中包含某些单词，则提取文本

2 个答案: