Question

我写了一个脚本，刮擦网站并将内容放入Word文件中。我有一个问题，因为像下面的代码一样，有两个段落，我想从两个段落中分别获取文本。所以我的问题是：有什么方法可以只搜索两个特定的h2类之间的段落，或者如何解决这个问题？

HTML：

<h2 class="pt-3" id="mitigation">Mitigation</h2>
<p>Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.</p><p>Any user can also spoof access tokens if they have legitimate credentials. Follow mitigation guidelines for preventing adversary use of <a href="/techniques/T1078">Valid Accounts</a>. Limit permissions so that users and user groups cannot create tokens. This setting should be defined for the local system account only. GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Create a token object. <span  id="scite-ref-19-a" class="scite-citeref-number" data-reference="Microsoft Create Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/create-a-token-object" target="_blank" data-hasqtip="18" aria-describedby="qtip-18">[19]</a></sup></span> Also define who can create a process level token to only the local and network service through GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Replace a process level token. <span  id="scite-ref-20-a" class="scite-citeref-number" data-reference="Microsoft Replace Process Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/replace-a-process-level-token" target="_blank" data-hasqtip="19" aria-describedby="qtip-19">[20]</a></sup></span></p><p>Also limit opportunities for adversaries to increase privileges by limiting Privilege Escalation opportunities.</p>
<h2 class="pt-3" id="detection">Detection</h2>
<p>If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows. <span  id="scite-ref-21-a" class="scite-citeref-number" data-reference="Microsoft Command-line Logging"><sup><a href="https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing" target="_blank" data-hasqtip="20" aria-describedby="qtip-20">[21]</a></sup></span></p><p>If an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. </p><p>There are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> <span  id="scite-ref-22-a" class="scite-citeref-number" data-reference="Microsoft LogonUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx" target="_blank" data-hasqtip="21" aria-describedby="qtip-21">[22]</a></sup></span>, <code>DuplicateTokenEx</code> <span  id="scite-ref-23-a" class="scite-citeref-number" data-reference="Microsoft DuplicateTokenEx"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx" target="_blank" data-hasqtip="22" aria-describedby="qtip-22">[23]</a></sup></span>, and <code>ImpersonateLoggedOnUser</code> <span  id="scite-ref-24-a" class="scite-citeref-number" data-reference="Microsoft ImpersonateLoggedOnUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx" target="_blank" data-hasqtip="23" aria-describedby="qtip-23">[24]</a></sup></span>). Please see the referenced Windows API pages for more information.</p><p>Query systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account. <span  id="scite-ref-3-a" class="scite-citeref-number" data-reference="BlackHat Atkinson Winchester Token Manipulation"><sup><a href="https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf" target="_blank" data-hasqtip="2" aria-describedby="qtip-2">[3]</a></sup></span></p>

代码：

import requests
from bs4 import BeautifulSoup
import time
from docx import Document

def linkgenerator_getlink():
   link = "https://attack.mitre.org/techniques/"
    for i in range(1001, 1224):
        fullurl = link + "T" + str(i) + "/"
        source = requests.get(fullurl).text
        time.sleep(15)
        soup = BeautifulSoup(source, 'lxml')

        document = Document()
        document.add_heading(soup.find('h1').text.strip().encode("UTF-8"), 0)

        p = soup.findAll("p")
        for x in p:
            paragraphs = unicode(x.text)
            p1 = document.add_paragraph(paragraphs)
        document.save('C:\\Users\XXX\Desktop\\script\\' + (str("T%s.docx") % str(i)))
        print "========== %s-es szamu doksi is ready ==========" % i



linkgenerator_getlink()

Answer 1

只要您特别知道这些标记，就可以专门放入这些标记。否则，可能需要创建变量以进行迭代。但是您会对此有所了解，因为您知道html的外观。

import bs4

r = '''<h2 class="pt-3" id="mitigation">Mitigation</h2>
        <p>Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.</p><p>Any user can also spoof access tokens if they have legitimate credentials. Follow mitigation guidelines for preventing adversary use of <a href="/techniques/T1078">Valid Accounts</a>. Limit permissions so that users and user groups cannot create tokens. This setting should be defined for the local system account only. GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Create a token object. <span  id="scite-ref-19-a" class="scite-citeref-number" data-reference="Microsoft Create Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/create-a-token-object" target="_blank" data-hasqtip="18" aria-describedby="qtip-18">[19]</a></sup></span> Also define who can create a process level token to only the local and network service through GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Replace a process level token. <span  id="scite-ref-20-a" class="scite-citeref-number" data-reference="Microsoft Replace Process Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/replace-a-process-level-token" target="_blank" data-hasqtip="19" aria-describedby="qtip-19">[20]</a></sup></span></p><p>Also limit opportunities for adversaries to increase privileges by limiting Privilege Escalation opportunities.</p>
        <h2 class="pt-3" id="detection">Detection</h2>
        <p>If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows. <span  id="scite-ref-21-a" class="scite-citeref-number" data-reference="Microsoft Command-line Logging"><sup><a href="https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing" target="_blank" data-hasqtip="20" aria-describedby="qtip-20">[21]</a></sup></span></p><p>If an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. </p><p>There are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> <span  id="scite-ref-22-a" class="scite-citeref-number" data-reference="Microsoft LogonUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx" target="_blank" data-hasqtip="21" aria-describedby="qtip-21">[22]</a></sup></span>, <code>DuplicateTokenEx</code> <span  id="scite-ref-23-a" class="scite-citeref-number" data-reference="Microsoft DuplicateTokenEx"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx" target="_blank" data-hasqtip="22" aria-describedby="qtip-22">[23]</a></sup></span>, and <code>ImpersonateLoggedOnUser</code> <span  id="scite-ref-24-a" class="scite-citeref-number" data-reference="Microsoft ImpersonateLoggedOnUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx" target="_blank" data-hasqtip="23" aria-describedby="qtip-23">[24]</a></sup></span>). Please see the referenced Windows API pages for more information.</p><p>Query systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account. <span  id="scite-ref-3-a" class="scite-citeref-number" data-reference="BlackHat Atkinson Winchester Token Manipulation"><sup><a href="https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf" target="_blank" data-hasqtip="2" aria-describedby="qtip-2">[3]</a></sup></span></p>'''


html = bs4.BeautifulSoup(r)

# assuming the 1st paragraph you want is id="mitigation"
# find that, then grab the next sibling
para_1 = html.find('h2', {'id':'mitigation'})
p1 = para_1.find_next_sibling('p').text

para_2 = html.find('h2', {'id':'detection'})
p2 = para_2.find_next_sibling('p').text

Answer 2

这会在所有<p>标签旁边显示<h2>标签的内部文本，并带有指定的类：

import bs4 as bs

content = """<h2 class="pt-3" id="mitigation">Mitigation</h2>
<p>Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.</p><p>Any user can also spoof access tokens if they have legitimate credentials. Follow mitigation guidelines for preventing adversary use of <a href="/techniques/T1078">Valid Accounts</a>. Limit permissions so that users and user groups cannot create tokens. This setting should be defined for the local system account only. GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Create a token object. <span  id="scite-ref-19-a" class="scite-citeref-number" data-reference="Microsoft Create Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/create-a-token-object" target="_blank" data-hasqtip="18" aria-describedby="qtip-18">[19]</a></sup></span> Also define who can create a process level token to only the local and network service through GPO: Computer Configuration &gt; [Policies] &gt; Windows Settings &gt; Security Settings &gt; Local Policies &gt; User Rights Assignment: Replace a process level token. <span  id="scite-ref-20-a" class="scite-citeref-number" data-reference="Microsoft Replace Process Token"><sup><a href="https://docs.microsoft.com/windows/device-security/security-policy-settings/replace-a-process-level-token" target="_blank" data-hasqtip="19" aria-describedby="qtip-19">[20]</a></sup></span></p><p>Also limit opportunities for adversaries to increase privileges by limiting Privilege Escalation opportunities.</p>
<h2 class="pt-3" id="detection">Detection</h2>
<p>If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the <code>runas</code> command. Detailed command-line logging is not enabled by default in Windows. <span  id="scite-ref-21-a" class="scite-citeref-number" data-reference="Microsoft Command-line Logging"><sup><a href="https://technet.microsoft.com/en-us/windows-server-docs/identity/ad-ds/manage/component-updates/command-line-process-auditing" target="_blank" data-hasqtip="20" aria-describedby="qtip-20">[21]</a></sup></span></p><p>If an adversary is using a payload that calls the Windows token APIs directly, analysts can detect token manipulation only through careful analysis of user network activity, examination of running processes, and correlation with other endpoint and network behavior. </p><p>There are many Windows API calls a payload can take advantage of to manipulate access tokens (e.g., <code>LogonUser</code> <span  id="scite-ref-22-a" class="scite-citeref-number" data-reference="Microsoft LogonUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378184(v=vs.85).aspx" target="_blank" data-hasqtip="21" aria-describedby="qtip-21">[22]</a></sup></span>, <code>DuplicateTokenEx</code> <span  id="scite-ref-23-a" class="scite-citeref-number" data-reference="Microsoft DuplicateTokenEx"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa446617(v=vs.85).aspx" target="_blank" data-hasqtip="22" aria-describedby="qtip-22">[23]</a></sup></span>, and <code>ImpersonateLoggedOnUser</code> <span  id="scite-ref-24-a" class="scite-citeref-number" data-reference="Microsoft ImpersonateLoggedOnUser"><sup><a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa378612(v=vs.85).aspx" target="_blank" data-hasqtip="23" aria-describedby="qtip-23">[24]</a></sup></span>). Please see the referenced Windows API pages for more information.</p><p>Query systems for process and thread token information and look for inconsistencies such as user owns processes impersonating the local SYSTEM account. <span  id="scite-ref-3-a" class="scite-citeref-number" data-reference="BlackHat Atkinson Winchester Token Manipulation"><sup><a href="https://www.blackhat.com/docs/eu-17/materials/eu-17-Atkinson-A-Process-Is-No-One-Hunting-For-Token-Manipulation.pdf" target="_blank" data-hasqtip="2" aria-describedby="qtip-2">[3]</a></sup></span></p>"""

soup = bs.BeautifulSoup(content, features="html.parser")

for h2_tag in soup('h2', {'class': 'pt-3'}):
    print(h2_tag.next_sibling.next_sibling.text)
    print("") # line of separation after each paragragh

输出：

Access tokens are an integral part of the security system within Windows and cannot be turned off. However, an attacker must already have administrator level access on the local system to make full use of this technique; be sure to restrict users and accounts to the least privileges they require to do their job.

If an adversary is using a standard command-line shell, analysts can detect token manipulation by auditing command-line activity. Specifically, analysts should look for use of the runas command. Detailed command-line logging is not enabled by default in Windows. [21]

如何使用BeautifulSoup抓取HTML？

2 个答案: