例如,我有一大堆带有HTML标记的文本,并且我想拥有一个从代码中删除HTML标记的功能。 但是我只想删除标签而不是文本。 这个问题比看起来要复杂得多,因为如果您有一些ol或ul标签,并且我想先删除ol,那么我不希望文本被删除并 要删除的li标签,但仅用于ol标签,而不用于ul。
我尝试使用BeautifulSoup和一些NLP技术,但没有成功
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
html_know='''<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" class="image_master" alt="" style="width: 248px; height: 164px; vertical-align: middle;">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" style="width: 250px; height: 166px;">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" style="width: 249px; height: 165px;">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" style="width: 249px; height: 165px;">
<p></p>
<p><strong><span style="font-family: Impact, Charcoal, sans-serif; font-size: 36px;">HTML</span></strong></p> <span style="background-color: rgb(255, 255, 0);"><p></p>HTML stands for Hyper Text Markup Language, which is the most widely used language on Web to develop web pages. HTML was created by Berners-Lee in late 1991 but "HTML 2.0" was the first standard HTML specification which was published in 1995. HTML 4.01 was a major version of HTML and it was published in late 1999.</span>Though
HTML 4.01 version is widely used but currently we are having HTML-5 version which is an extension to HTML 4.01, and this version was published in 2012. Audience This tutorial is designed for the aspiring Web Designers and Developers with a need to understand
the HTML in enough detail along with its simple overview, and practical examples. This tutorial will give you enough ingredients to start with HTML from where you can take yourself at higher level of expertise.
<p></p>
<p>
</p>
<p></p>HTML stands for Hypertext Markup Language, and it is the most widely used language to write Web Pages. Hypertext refers to the way in which Web pages (HTML documents) are linked together. Thus, the link available on a webpage is called Hypertext.
As its name suggests, HTML is a Markup Language which means you use HTML to simply "mark-up" a text document with tags that tell a Web browser how to structure it to display. Originally, HTML was developed with the intent of defining the structure of
documents like headings, paragraphs, lists, and so forth to facilitate the sharing of scientific information between researchers. Now, HTML is being widely used to format web pages with the help of different tags available in HTML language. Basic HTML
Document In its simplest form, following is an example of an HTML document
<p></p>
<p><img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAOEAAADhCAMAAAAJbSJIAAAA8FBMVEX////kTSbxZSnr6+sAAADkSR7pdFzrWSjr7/DHx8fxYyX39/fkQxPwXBT4v63xXhnoq6D3rZXkPwTyc0Dq0Mzr8/TwVwDa2tpTU1PnmInlXDv97el7e3uoqKjlb1biOAAdHR1oaGjjQg+4uLiEhIQ2NjZwcHDtXijytKnnUyftlIPnaU32p4753tn1x7/98vDzhFzqfWfwpZf31M71l3josqnpxL7lVC/q3tvriHXze030vrTmYEH42dMqKir1w7rq19PqUxv4t6P5ybr0i2f2oITnkoLmiXiamppDQ0Ourq7zf1T1k3PqSwDnoZXsaUDC4HfnAAALJklEQVR4nO2dfVvaShOHgxCggC+NgkX6SKU9p1SLVKxSqlarp9qe4+n5/t/mIcSFJMwMO2R2iVz8/tJcNeFuwuZmZzY4Di8vcioH423jTbns8Le3OTzvov/8W3jPryfbd/3fX41/fcF8jckiSpj7OdnxbmjzEhGO/iDItyUlfKX28TO8dZkIc2+DXZQjG5eKMBfs4mCJCd9F97qEhKO/ebPUhL9iw8zzJnwNEObeluNbUkIIZXR7Kz9lcmZej7dBhG9exbeknVDlfyHCcMg9rAjNZ0UYyoow9+aP8N9+m7j30hDmwp+YcrvLSOi8m/x8EBKbJSIM/VxeUsLx7t466SP8I7sbJJuA0Pkz+Mn/JIwSvjoI56djNiFC4CXzCZ9+K1OE0fz53AhHO/zbWWJC/+P9r9EPS0tYVn+2tITO309jx/ISqqwIV4RCmczCvAFeMkIYfVEY4WT7iPAXQnjgmE32xTjjbZNNZfif7kZ2AewB2s3uCzjRna2yyiqrrLLKKqusssoqzySNo3yac9RITNhtumlOs5v8JDYzaU4zOaDjLhqCjCtAeJRmRPdIgPAk1YQnAoSHqSY8FCDsVRaNQaTSEyBstxaNQaTVFiA89haNQcQ7FiC8SjXhlQBhI823/GZyafO1bdEYRCSkzXFkr9IPhYT5EN6bJwHoiAJmCmsJ8z6yOxHCS9FbfmLCQmhn7qUIYT7FhHkRwod0EYbfh+6DCKGstokSikibtLYlJgzvTETaHOc6xYTXIoSnojdEUULvVISwk2LCjgihrLYlJYzc8GWkTVjbRAllpE14tk2UUGKmzc+eJGJSwojS7AkRimqbKKGMtDnOIE2EEWkbCBHWJbVNkrBSFyJsp5ZQRtoc50JS25IShvfVuhAiJLTNrXDzfZ2bEkooJG2UtrkPW9y0N7j5q4QSykgbpW2VrWqRmfLs48XyYzMEaETaiCJppV7MMsMnjJzDKKEUIE7oDqrmCbctEPZRwrwFwvXw2zCsNJm+GOE+JjXupQXCGkbo7osR4tpWMU/YjRAakTZK25rmR5oOSigmbdR8YvOLccLTdZRQZi7RD65t3o1xwrsIYfjgYtJGlYG9c+5lyia8RQlFCsBBcG1rnRkn/LiJEkpJG1UGrmwZJ/yMEooUgIOUBbWNTfgJVxr+vRUNOtLwtY39qnBpa8kBSmobm7AUJjQkbUT3Hl/b2IS4tEl07akQ3XumCbs4oUTXngrevddiArIJcWkT6dpTwbXNg7SN+ozPPfSVDWnjalvxrI6n95KXR3SWRlDauNpWvPeIubZNXoiZNjlpo7r3IG0r/iamkJPNl0YIJbr2VHjaVjwjppDlCAWljZxPBLSteG6K0NBcoh9c2x6mb4jFHaJqLEcoKW1E9x6obV9MEUakTaZrTwXt3oO1jehtECMU6tpTIbQN+PhUJEqqcoSS0kZ173nAKazuGSI00LWnwpttqxIricQIRaWN6t6DtK1KVP4TEYZ3JNS1p3KMEwLaVr23QCgqbVQZmKttcoRSBeAgDZQQ1jYbhJLSJqltSQgNShtVJGVqmxyhLCAhpkxtS0IYkTZZLaW07QjSNvOEwtJGde+5AGGxgq0uzxRKrOCEUl17Kvg9HCqSVi/3sPy3zUoE0aC0kWVgaLatioZ3WKIALCttVPces0jKmy/Fy6NiXXsq+KILZpGUR7iBF4BlllpMQmjbtUHCW7w8KittVBmYWSTlEf6wUQAOwtM2McJHvDwqLG1MbRMj/Gqhp00Fv+OfGCRcQwmlllpMgi66YBZJeYSRt6GRpRaToN17zCIpi7Bso2tPheje4wDyCBs2uvZUcG0Di6QyhFa69lS2hLSNRXhqT9rIMvAO54bIIiSkTbIAHERK21iEFqVNTttYhBaljSwDs7SNRfiIL7WQnUv0UyYWznBuiCzCf1Cl8QS79lTQsXSobfpLZniEeNeevLRR3XuXNzv66TSQQMdEpU20a08Fr5i5HiP/1pBAx7TTtacitFYWmS8tbQOHtCptjnMos5IUI/wEHJKQNsmuPRWhh7gghJufgUMS0iY9l+hH6CEuGOFH4JBWllpMIvTsPYRw/RY4JCFtsgXgIELP3sMI74BDEkstJLv2VISevYcRQiZtVdroh7iAVSYWYQ0yaWKphfhcoh+csJ/fnwomCBgh9JJtrI8NB3eaPFBjwsrAGCF0RFRphLv2VHBt6zPKwDBhqcQiNCJtVPeeB5WBWVcpKG3E+ljZrj0VfNFFEziFVeSzCELIkzbRpRaT8BZdYN17MKGGtBktAAchuveA2bYqclEjhExpk59L9INrWwvq3kNaUhFCUNpsrI8NR6Z7DyZMg7SRs22/AUJkkhwhhKTNyvrYcIgy8D1AiPQ2wISpkDbi2XvQWtniOfzPEULopODSJvWsvXgwwIy7D9zyb1iE0PHw9bFmpI3q3tuD+hPhi5pBiD62RbxrT4XZvccgLK0Bh7O0PjYcfNEFtFa2yhhLS/8Ah7MubdwycDXfbAH/JdOEpc1a7QdwuCt7XXsqPG3LVrPn9/0pykKMbr22/RGuBVqXtjm694rDD8JnA9erhChDhKX12trnO/TWZrFrT4WnbWPK4s7vvDemLIwvzfW/Nkg1IaRNvgAcZN4ycLE4vGDrl8EFWwguza+3M9XSagE4SJLuPZ/y+qHvVb7jb7xY7EsbU9vAt+XN2eGdrjXjXXumpI2ebdMgHFHqH20NlTYTBeAgEt17+lVuXGnku/ZUiEcma7djaBMS0ia91GISvAwMrZVNSGi5AByEuegiGSEhbfJdeyoSiy60CS0utZiE6N7TXnShTbgAaZtT2+YlfGlf2mhtq2oiahMuQNrIMnD//jyrBalHWL77EeazJW3k98q6reb+1k51JqUGYef2a219kyA0Bzjjm3Pdipd5OPtSLVKUMwgbd4+btdijk+LSZk5Lie690Kn0jn7vFPFTSRGeftwenrwpvDihia49FWLRReRUeifoqcQIGxufELo4oYGlFpNof3Pu8F25VwfHHoiwOxxXausYnR9L0uY494zetuGpbOW3buJjzxRh5+VX4uQBhAaljd+951aamYfrbPiCjRA2Nh5rwLgynfBOjc0l+pnnmy5iY8+EMBhXZtPFCY107anM2b03vGDdwdPYExA2bj/pnTyA0FABOEiCLyhTY89oXNmkx5UZhOakLWn3nj/2nPQ0xpWpmO/aU8EfmaxN+Z1LN01oEJB4xIl25nryh8Fn7cUzW9uMEEaUxqS0Ud179gjNdO2paGubLKGFrj0VvHvPJGHkbWhkqcUkyRddcAnfR7+I27C0SSy6YBG+/zC9A6PSJrHoQptw6uQ9xai0SSy60CNE6EaEZrr2VJIvuphN+L5A7sCotEl8cy5NiF2aYUKDc4l+TI400LgyHbPSRnTvJSTUOHkqhgnR7r0khPp0Brv2VPLNSjLGOOGMcSXOV2maKwA/5bS3B/arzUXIOXmjedj9tskP+ON0rwfu3KeyMOfJazX7dXNVNSCdraOmN4+kFpjjSuZpEv3C7G0QzvHhdOuhBqHeTWGM5zWPemY1hkyjfdLyEo49BJ0/oXxs+A6vkaveZaKxB6Pzmnk744pOutcP8489EJ4/uWp1XNFJp52fb+yJ01W8ymLGFZ2c1ucYeyJ4/riSmksTTuPipDLf2DO8NDOHxwaerGMgV73hzZIF6d/y8u20XppguseHru4Fm85xRSeNdqiDHT15TXdwsfhb3vw5rROi/hzGFZ10LwbAqXxO44pOOpGxZ6TSz2tc0Un5SdSHJ++y9yzHFZ00Lgb9Q8sq/X89z7xPOazagAAAAABJRU5ErkJggg==" style="width: 119px; height: 119px;"></p>
<table style="width:100%">
<tbody>
<tr>
<th style="border-color: rgb(0, 0, 0);">Firstname</th>
<th style="border-color: rgb(0, 0, 0);">Lastname</th>
<th style="border-color: rgb(0, 0, 0);">Age</th>
</tr>
<tr>
<td style="border-color: rgb(0, 0, 0);">Jill</td>
<td style="border-color: rgb(0, 0, 0);">Smith</td>
<td style="border-color: rgb(0, 0, 0);">50</td>
</tr>
<tr>
<td style="border-color: rgb(0, 0, 0);">Eve</td>
<td style="border-color: rgb(0, 0, 0);">Jackson</td>
<td style="border-color: rgb(0, 0, 0);">94</td>
</tr>
</tbody>
</table>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>HTML Tags As told earlier, HTML is a markup language and makes use of various tags to format the content. These tags are enclosed within angle braces Except few tags, most of the tags have their corresponding closing tags. For example, has its closing
tag and tag has its closing tag tag etc. Above example of HTML document uses the following tags ? Sr.No Tag & Description 1 This tag defines the document type and HTML version. 2 This tag encloses the complete HTML document and mainly comprises of
document header which is represented by ... and document body which is represented by ... tags. 3 This tag represents the document's header which can keep other HTML tags like html,head,body,title,...etc
<ol>
<li>2</li>
<li>2</li>
<li>3</li>
</ol>
<ul>
<li>sdfsdf</li>
<li>s</li>
<li>dfsd</li>
<li>f</li>
<li>sd</li>
<li>f</li>
<li>sd</li>
</ul>
<p></p>
<p><iframe width="1019px" height="311px" src="//www.youtube.com/embed/uCg2BoKiuOM" frameborder="0" allowfullscreen=""></iframe></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>'''
soup=BeautifulSoup(html_know, 'html.parser')
tags=soup.find_all('table')
print(tags[0].text)
print(html_know[3])
其背后的想法是有时候我想删除一些标签,而另一些时候想删除其他标签。
请允许您在不对所有内容进行硬编码的情况下给我一些想法