在python中处理html片段中的数据片段

时间:2013-07-09 18:40:43

标签: python

我确信以前曾经问过,但我无法在任何地方找到答案......

我有一个字符串,它基本上是HTML页面的一部分。看起来很像这样:

body = u'<div class="admonition warning">\n<p class="first admonition-title">Warning</p>\n<p class="last">Read all of this! ALL OF IT!</p>\n</div>\n<div class="section" id="pitfalls-and-common-mistakes">\n<h1>Pitfalls and Common Mistakes<a class="headerlink" href="#pitfalls-and-common-mistakes" title="Permalink to this headline">\xb6</a></h1>\n<p>New and old users alike can run into a pitfall. Below we outline issues that we\nsee frequently as well as explain how to resolve those issues. In the #nginx IRC\nchannel on Freenode, we see these issues frequently.</p>\n<div class="section" id="this-guide-says">\n<h2>This Guide Says<a class="headerlink" href="#this-guide-says" title="Permalink to this headline">\xb6</a></h2>\n<p>The most frequent issue we see happens when someone attempts to just copy and\npaste a configuration snippet from some other guide. Not all guides out there\nare wrong, but a scary number of them are. Even the Linode library has poor\nquality information that some Nginx community members have futily attempted to\ncorrect.</p>\n<p>The Ngx CC Docs were created and reviewed by community members that work\ndirectly with all types of Nginx users. This specific document exists only\nbecause of the volume of common and recurring issues that community members see.</p>\n</div>\n<div class="section" id="my-issue-isn-t-listed">\n<h2>My Issue Isn\'t Listed<a class="headerlink" href="#my-issue-isn-t-listed" title="Permalink to this headline">\xb6</a></h2>\n<p>You don\'t see something in here related to your specific issue. Maybe we didn\'t\npoint you here because of the exact issue you\'re experiencing. Don\'t skim this\npage and assume you were sent here for no reason. You were sent here because\nsomething you did wrong is listed here.</p>\n<p>When it comes to supporting many users on many issues, community members don\'t\nwant to support broken configurations. Fix your configuration before asking for\nhelp. Fix your configuration by reading through this. Don\'t just skim it.</p>\n</div>\n<div class="section" id="root-inside-location-block">\n<h2>Root inside Location Block<a class="headerlink" href="#root-inside-location-block" title="Permalink to this headline">\xb6</a></h2>\n<p>BAD</p>\n<div class="highlight-nginx"><pre>server {\n    server_name www.domain.com;\n      location / {\n          root /var/www/nginx-default/;\n          [...]\n      }\n      location /foo {\n          root /var/www/nginx-default/;\n          [...]\n      }\n      location /bar {\n          root /var/www/nginx-default/;\n          [...]\n      }\n}</pre>\n</div>\n<div class="highlight-nginx"><div class="highlight"><pre><span class="k">def</span> <span class="s">greet(name):</span>\n    <span class="s">print</span> <span class="s">&#39;Hello&#39;,</span> <span class="s">name</span>\n\n<span class="s">greet(&#39;Jack&#39;)</span>\n<span class="s">greet(&#39;Jill&#39;)</span>\n<span class="s">greet(&#39;Bob&#39;)</span>\n</pre></div>\n</div>\n'

无论如何,这是缩短版本。

在该块内是“&lt; div class =”highlight-nginx“&gt;&lt; pre&gt;”和“&lt; / pre&gt;&lt; / div&gt;”这将在同一页面中多次出现。每次出现时,我都想操纵它里面的文字。我已准备好功能,我想通过它。但是,我无法弄清楚如何从中获取文本,在函数中运行它,并将其粘贴回字符串并保持其他所有内容相同。

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:5)

您可以使用像Beautiful Soup这样的html解析器。

from bs4 import BeautifulSoup
soup = BeautifulSoup(body)
for div in soup.find_all(class_='highlight-nginx'):
    div.pre.string = my_function(div.pre.string)

答案 1 :(得分:0)

你想要的是re.findall()结合不合理正则表达式。

试试这个(注意:这是未经测试的):

import re

your_new_text = your_text = '<div class="highlight-nginx"><pre>whatever is inbetween here</pre></div><div class="highlight-nginx"><pre>some more text to change</pre></div><div class="highlight-nginx"><pre>whatever is inbetween here</pre></div>'

pre_text = '<div class="highlight-nginx"><pre>'
post_text = '</pre></div>'
regex = re.compile(r'{pre_text}(.*?){post_text}'.format(pre_text=pre_text,
    post_text=post_text)
# Find all the matches of our regular expression above
list_of_matches = re.findall(your_text)

for text in list_of_matches:
    # We look for an exact match, including the pre and post tags so we're don't perform
    # the wrong sub later on.
    old_text = '{pre_text}{old_string}{post_text}'.format(
        pre_text=pre_text,
        old_string=text,
        post_text=post_text)

    new_text = '{pre_text}{manipulated_text}{post_text}'.format(
        pre_text=pre_text,
        manipulated_text=manipulate_text(text),
        post_text=post_text)

    # We have the old strings and we now replace them with the new strings.
    your_new_text = your_new_text.replace(old_text, new_text)

print(your_new_text)