某人试图从某个网站上收集一些统计数据,例如,要提取word
并计数在同一标签中找到的近邻单词
输入
<div class="col-xs-12">
<p class="w50">Operating Temperature (Min.)[°C]</p>
<p class="w50 upperC">-40</p>
</div>
会导致
TAG 1
Operating , 2 i.e #<Temperature, (Min.)[°C]>
Temperature, 2 i.e #<Operating, (Min.)[°C]>
(Min.)[°C], 2 i.e #<Operating,Temperature>
TAG 2
-40, 0
这就是我最终要做的,但是它提取了整个文本
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
page = url.read()
soup = BeautifulSoup(page, features='lxml')
# [print(tag.name) for tag in soup.find_all()]
for script in soup(["script", "style"]):
script.decompose() # rip it out
invalid_tags = ['br']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
html = soup.find_all(recursive=False)
for tag in html:
print(tag.get_text())
我尝试使用recursive = True
,但是结果重复很多
答案 0 :(得分:1)
这可能不是您期望的结果,但至少可以给您提示。我对您的代码做了一些修改。
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
page = url.read()
soup = BeautifulSoup(page, features='lxml')
for script in soup(["script", "style"]):
script.decompose() # rip it out
invalid_tags = ['br']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
html = soup.find_all(recursive=False)
textlist = []
for tag in html:
text = tag.text.replace("\r","").replace("\t","").split("\n")
for t in text:
if t != '':
textlist.append(t)
for tt in textlist:
print(tt)
for ts in tt.split():
print ("{}, {}".format(ts,len(tt.split())-1))
print("-----------------------------")