Python使用BeautifulSoup查找文本

时间:2014-12-22 12:18:54

标签: python beautifulsoup html-parsing

我在源文件的末尾有一条HTML注释。

<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TextTransApplied:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TagTransAttempted:(8), ASYNC_JAVASCRIPT(61);TagTransFailed:ASYNC_JAVASCRIPT(42);TagTransApplied:(8), ASYNC_JAVASCRIPT(19); ] -->

现在我想检查括号中的所有内容是否大于零。例如,我想从RENAME_JAVASCRIPT获取值18,并检查它是否大于零,其余部分也是如此。由于这是一个注释而不是任何html标签的一部分,所以BeautifulSoup中有没有办法实现这一点。

1 个答案:

答案 0 :(得分:0)

我只想用re:

import re
from bs4 import BeautifulSoup
with open("/sample_html.txt") as f:
    soup = BeautifulSoup(f.read())
    tag = soup.find("html").next_sibling
    print(all( x > 0 for x in map(int,re.findall("\((\d+)\)",tag))))

    True 

如果你想看到名字:

from bs4 import BeautifulSoup
with open("/sample_html.txt") as f:
    soup = BeautifulSoup(f.read())
    tag = soup.find("html").next_sibling
    for ele in re.findall("\w+\(\d+\)",tag):
         if int(ele.split("(")[1].rstrip(")")) > 0:
            print(ele)
RENAME_JAVASCRIPT(18)
RENAME_IMAGE(7)
MINIFY_JAVASCRIPT(25)
JAVASCRIPT_HTML5_CACHE(19)
EMBED_JAVASCRIPT(1)
RENAME_CSS(3)
IMAGE_COMPRESSION(7)
RESPONSIVE_IMAGES(6)
ASYNC_JAVASCRIPT(2)
RENAME_JAVASCRIPT(18)
RENAME_IMAGE(7)
MINIFY_JAVASCRIPT(25)
JAVASCRIPT_HTML5_CACHE(19)
EMBED_JAVASCRIPT(1)
RENAME_CSS(3)
IMAGE_COMPRESSION(7)
RESPONSIVE_IMAGES(6)
ASYNC_JAVASCRIPT(2)
ASYNC_JAVASCRIPT(61)
ASYNC_JAVASCRIPT(42)
ASYNC_JAVASCRIPT(19)