使用python进行未标记的文本提取不起作用

时间:2017-09-21 03:40:25

标签: python beautifulsoup

我想使用python和美丽的汤从下面的标签中提取1626 我试过这个答案Accessing untagged text using beautifulsoup 但我得到的只是一个空数组[]

<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
<!-- br-->
<div>...</div>
</div>

我该如何提取数字?

2 个答案:

答案 0 :(得分:0)

您可以循环使用html代码并使用正则表达式找到所需内容

import bs4, re

page = """
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
    5526 Total Items
                    4426 Total Items
<!-- br-->
<div>...</div>
</div>"""

soup = bs4.BeautifulSoup(page, 'lxml')

divs = soup.findAll('div', {'class' : 'columns'})
div= divs[0]    # we only have one div

divtext= str(div).split('\n')   # get div html code and split it's lines
for line in divtext:
    line = line.strip()

    # match wanted pattern
    match = re.match(r'^(\d+)\s*Total Items$', line)

    if match is not None:     #if match found
        print(match.group(1)) # extract the number

答案 1 :(得分:0)

我尝试使用您在上述问题中附加的此link中使用的相同约定。

希望这就是你要找的东西。

代码:

data = """
<div class="columns">
<h1 style="line-height: .85em; margin-top: 0" class="panel-border text-primary strong">
            Laundry Dry Cleaning Equipment
            <br>

            <br>
</h1>

        1626 Total Items
<!-- br-->
<div>...</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for i in soup.find_all(text=True, recursive=True):
    if "Total Items" in i:
       print(str(i).replace(' ', '').replace('TotalItems', ''))

输出:

1626