我有一个文本文件,其内容如下,在特定字符串匹配(下面突出显示/加粗)之后,我需要提取 <a href="https://support.oracle.com/******">29565618></a>
<div title="Available on both MOS and OTN">Oracle
JDK 8 Update 212 <strong>
(公共) </strong></div>
注意:href标记在输入文本文件中此模式匹配之后的第二行上方。
输入文本文件:
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29565618">29565618</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206839">29206839</a></td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206838">29206838</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206859">29206859</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
预期输出:
29565618
我的代码:
with open('file.txt') as f:
my_list = list(f)
try:
if my_list.index('JDK') > 0 and my_list.index('public') > 0:
print(string[4:-4])
except:
pass
答案 0 :(得分:3)
您可以使用“美丽汤”来做到这一点:
from bs4 import BeautifulSoup
html_doc = """
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29565618">29565618</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206839">29206839</a></td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206838">29206838</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206859">29206859</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>"""
soup = BeautifulSoup(html_doc, 'html.parser')
trs = soup.find_all('tr')
for tr in trs:
if tr.div:
div_text = tr.div.get_text()
if "JDK" in div_text and "public" in div_text:
for td in tr.find_all('td'):
td_text = td.get_text()
if td_text.isdigit():
print(td_text)
输出:
29565618
答案 1 :(得分:1)
如果data
是问题的HTML代码段,则此脚本:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
for a in soup.select('td.km:has(~ td.km) > a'):
if re.findall(r' JDK.*?\(public\)', a.find_next('td', class_='km').text):
print(a.text)
打印:
29565618
答案 2 :(得分:1)
soup = BeautifulSoup(html_doc, 'html.parser')
match = soup.find(text=lambda t: "JDK" in t)
if match and 'public' in match.parent.text:
print(match.find_previous('a').text)
感谢@Andrej Kesely
答案 3 :(得分:0)
答案 4 :(得分:0)
那
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = ''' <tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29565618">29565618</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206839">29206839</a></td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206838">29206838</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206859">29206859</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>'''
doc = SimplifiedDoc(html)
trs = doc.trs.contains(['JDK','public'])
for tr in trs:
print(tr.a.text) # 29565618