我真的是Python的新手,所以我仍在尝试弄清楚Beautiful Soup。我正在尝试抓取一个网站并提取五个元素,这些元素应紧跟在我的代码中找到的标记之后。
我尝试过next.element,它仅提取我在soup.find中使用的标签的文本,并且我尝试过next.sibling,它返回为空白。
页面上有很多“第一”和“最后”类,因此我必须指定文本所需要的行。这是我要抓的东西:
<li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>
这是我正在尝试的:
for x,y in zip(make, model):
url = ('https://URL with variables goes here')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
search = requests.get(url, headers = headers)
html = search.text
soup = BeautifulSoup(html, 'lxml')
search_results = soup.find('li', class_ = 'first', text = re.compile('Maintenance'))
try:
d = search_results.next_element
print(d)
except:
print('pass')
最终目标是将number1:number5数组追加到列表中,但是使用上面的代码,输出只是“ Maintenance”。我要去哪里错了?另外,由于我很新,如果您也能提供背景信息,我将非常感激。
答案 0 :(得分:0)
以您的示例为例,最简单的方法是将所有未定义类的const
method = play && (play.show_question && 'question' || play.show_answer && 'answer')
|| undefined;
元素追加到结果列表中。
li
输出:
from bs4 import BeautifulSoup
html = """ <li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>"""
soup = BeautifulSoup(html, 'lxml')
start = soup.find('li', class_ = 'first').parent
result = []
for ele in start.find_all('li'):
if not ele.get('class'):
result.append(ele.text)
print(result)
答案 1 :(得分:0)
您可以将xpath表达式与tree.xpath一起使用
//li[@class='first' and text()='Maintenance']/following-sibling::li[not(@class)]
例如
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>
'''
tree = fromstring(h)
items = [item.text for item in tree.xpath("//li[@class='first' and text()='Maintenance']/following-sibling::li[not(@class)]")]
print(items)
答案 2 :(得分:0)
与QHarr回答类似的东西,但有些不同:
h = '''
<li>
<ul>
<li class="first">Maintenance</li>
<li>$number1</li>
<li>$number2</li>
<li>$number3</li>
<li>$number4</li>
<li>$number5</li>
<li class="last">$linetotal</li>
</ul>
</li>
'''
from lxml import etree
doc = etree.fromstring(h)
for cost in doc.xpath('//li'):
if not 'class' in cost.attrib:
print(cost.text)
输出:
$number1
$number2
$number3
$number4
$number5