如何在Python中的模式匹配后提取第n行字符串?

时间:2019-12-25 23:25:55

标签: python regex python-3.x beautifulsoup pattern-matching

我有一个文本文件,其内容如下,在特定字符串匹配(下面突出显示/加粗)之后,我需要提取 <a href="https://support.oracle.com/******">29565618></a>

<div title="Available on both MOS and OTN">Oracle JDK 8 Update 212 <strong> (公共) </strong></div>

注意:href标记在输入文本文件中此模式匹配之后的第二行上方。


输入文本文件:

<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29565618">29565618</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206839">29206839</a></td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206838">29206838</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206859">29206859</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>

预期输出:

29565618

我的代码:

    with open('file.txt') as f:
        my_list = list(f)
        try:
            if my_list.index('JDK') > 0 and my_list.index('public') > 0:
                print(string[4:-4])
        except:
            pass

5 个答案:

答案 0 :(得分:3)

您可以使用“美丽汤”来做到这一点:

from bs4 import BeautifulSoup

html_doc = """
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29565618">29565618</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206839">29206839</a></td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206838">29206838</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206859">29206859</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>"""

soup = BeautifulSoup(html_doc, 'html.parser')

trs = soup.find_all('tr')

for tr in trs:
    if tr.div:
        div_text = tr.div.get_text()
        if "JDK" in div_text and "public" in div_text:
            for td in tr.find_all('td'):
                td_text = td.get_text()
                if td_text.isdigit():
                    print(td_text)

输出:

29565618

答案 1 :(得分:1)

如果data是问题的HTML代码段,则此脚本:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for a in soup.select('td.km:has(~ td.km) > a'):
    if re.findall(r' JDK.*?\(public\)', a.find_next('td', class_='km').text):
        print(a.text)

打印:

29565618

答案 2 :(得分:1)

soup = BeautifulSoup(html_doc, 'html.parser')

match = soup.find(text=lambda t: "JDK" in t)
if match and 'public' in match.parent.text:
    print(match.find_previous('a').text)

感谢@Andrej Kesely

答案 3 :(得分:0)

您可以使用:

(?=<a.*?>(.*)</a>)

选中此处,它会使用您的数据来确认匹配:https://regex101.com/r/W2wV2I/1/

答案 4 :(得分:0)

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = ''' <tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29565618">29565618</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle **JDK** 8 Update 212 <strong>(**public**)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206839">29206839</a></td>
<td class="km">Oracle JRE 8 Update 211 Enterprise Installer</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206838">29206838</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle SERVER JRE 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>
<tr>
<td class="km"><a href="https://support.oracle.com/epmos/faces/PatchResultsNDetails?patchId=29206859">29206859</a></td>
<td class="km">
<div title="Available on both MOS and OTN">Oracle Java SE Embedded 8 Update 211 <strong>(public)</strong></div>
</td>
<td class="km">16-APR-2019</td>
</tr>'''
doc = SimplifiedDoc(html)
trs = doc.trs.contains(['JDK','public'])
for tr in trs:
  print(tr.a.text) # 29565618