在BeautifulSoup4中获取标签之间的句子长度

时间:2019-04-28 13:53:20

标签: python beautifulsoup

某人试图从某个网站上收集一些统计数据,例如,要提取word并计数在同一标签中找到的近邻单词

输入

<div class="col-xs-12">
   <p class="w50">Operating Temperature (Min.)[°C]</p>
   <p class="w50 upperC">-40</p>
</div>

会导致

TAG 1

Operating , 2 i.e #<Temperature, (Min.)[°C]>
Temperature, 2 i.e #<Operating, (Min.)[°C]>
(Min.)[°C], 2 i.e #<Operating,Temperature>

TAG 2

-40, 0

这就是我最终要做的,但是它提取了整个文本

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
    with urllib.request.urlopen(url) as url:
        page = url.read()

soup = BeautifulSoup(page, features='lxml')

# [print(tag.name) for tag in soup.find_all()]

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

for tag in html:
    print(tag.get_text())

我尝试使用recursive = True,但是结果重复很多

1 个答案:

答案 0 :(得分:1)

这可能不是您期望的结果,但至少可以给您提示。我对您的代码做了一些修改。

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
    page = url.read()

soup = BeautifulSoup(page, features='lxml')

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

textlist = []
for tag in html:
    text = tag.text.replace("\r","").replace("\t","").split("\n")
    for t in text:
        if t != '':
            textlist.append(t)
for tt in textlist:
    print(tt)
    for ts in tt.split():
        print ("{}, {}".format(ts,len(tt.split())-1))
    print("-----------------------------")