我想搜索并计算字符串在webscrape中出现的次数。但是我想在webscrape中搜索x和y。
在下面的例子webscrape中,有人能告诉我最简单的方法来计算MAIN FISHERMAN和SECONDARY FISHERMAN之间的SEA BASS。
<p style="color: #555555;
font-family: Arial,Helvetica,sans-serif;
font-size: 12px;
line-height: 18px;">June 21, 2013 By FISH PPL Admin </small>
</div>
<!-- Post Body Copy -->
<div class="post-bodycopy clearfix"><p>MAIN FISHERMAN – </p>
<p><strong>CHAMP</strong> – Pedro 00777<br />
BAIT – LOCATION1 – 2:30 – SEA BASS (3 LBS 11/4)<br />
MULTI – LOCATION2 – 7:30 – COD (3 LBS 13/8)<br />
LURE – LOCATION5 – 3:20 – RUDD (2 LBS 6/1)</p>
<p>JOE BLOGGS <a href="url">url</a><br />
BAIT – LOCATION4 – 4:45 – ROACH (5 LBS 3/1)<br />
MULTI – LOCATION2 – 5:50 – PERCH (3 LBS 6/1)<br />
LURE – LOCATION1 – 3:45 – PIKE (2 LBS 5/1) </p>
BAIT – LOCATION1 – 2:30 – SEA BASS (3 LBS 11/4)<br />
MULTI – LOCATION1 – 3:45 – JUST THE JUDGE (3 LBS 3/1)<br />
LURE – LOCATION3 – 8:25 – SCHOOL FEES (2 LBS 7/1)</p>
<div class="post-bodycopy clearfix"><p>SECONDARY FISHERMAN – </p>
<p><strong>SPOON – <a href="url">url</a></strong><br />
BAIT – LOCATION1 – 2:30 – SEA BASS (3 LBS 11/4)<br />
MULTI – LOCATION2 – 7:30 – COD (3 LBS 7/4)<br />
LURE – LOCATION1 – 4:25 – TROUT (2 LBS 5/1)</p>
我尝试使用以下代码来实现此目的,但无济于事。
html = website.read()
pattern_to_exclude_unwanted_data = re.compile('MAIN FISHERMAN(.*)SECONDARY FISHERMAN')
excluding_unwanted_data = re.findall(pattern_to_exclude_unwanted_data, html)
print excluding_unwanted_data("SEA BASS")
答案 0 :(得分:6)
分两步完成:
像这样:
relevant = re.search(r"MAIN FISHERMAN(.*)SECONDARY FISHERMAN", html, re.DOTALL).group(1)
found = relevant.count("SEA BASS")
答案 1 :(得分:4)
如果您想使用'MAIN FISHERMAN'
和'SECONDARY FISHERMAN'
作为标记来查找要在<div>
内计算的'SEA BASS'
元素:
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
inbetween = False
count = 0
for div in soup.find_all('div', ["post-bodycopy", "clearfix"]):
if not inbetween:
inbetween = div.find(text=re.compile('MAIN FISHERMAN')) # check start
else: # inbetween
inbetween = not div.find(text=re.compile('SECONDARY FISHERMAN')) # end
if inbetween:
count += len(div.find_all(text=re.compile('SEA BASS')))
print(count)
答案 2 :(得分:2)
伪代码(未经测试):
count = 0
enabled = false
for line in file:
if 'MAIN FISHERMAN' in line:
enabled = true
elif enabled and 'SEA BASS' in line:
count += 1
elif 'SECONDARY FISHERMAN' in line:
enabled = false