Question

我有一个<td>，想要从中提取文字，就是我需要的文字 Tom Cruz，Homer Simpson，Bill Clinton ，每个{{1}使用一个python正则表达式标记。

<td>

有什么想法吗？

更新1.如果HTML Parser是标准方式，我应该怎么做呢？

Answer 1

我知道您要求使用仅限正则表达式的解决方案，但我建议您使用其中一个基于lxml的库（如html5lib或BeautifulSoup）来考虑其他更安全，更快速和更简单的方法，这些库可以解析无效的HTML并提供对lxml的访问权限树。

使用BeautifulSoup：

html = """
<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>
<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>
<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>
"""

import bs4
doc = bs4.BeautifulSoup(html, 'lxml')
print([el.text for el in doc.find_all('td')])

然后输出

['Tom Cruz', 'Home Simpson', 'Bill Clinton']

Answer 2

如果您正在寻找一个衬垫正则表达式 - >\u+(\s\u+)?</

如果不是让我们知道你的html存储在一个名为dat.txt的文件中。我不知道python但我知道ruby。也许你可以搞清楚。

xfile3=File.open("dat.txt","r") #html stored in dat.txt i=-2 #Logic here. For iterating i exactly to the position of names in the array ch= xfile3.read arr=ch.split(/[<,>]/) #for splitting ch into arr whenever < or > is encountered while i<=100 # replace 100 to some no as it suits i=i+4 puts arr[i] end

工作证明

在HTML td标记之间提取文本

2 个答案: