Question

我想解析搜索结果列表，只关注符合特定条件的结果链接。

让我们说结果有这样的结构：

<ul>
  <li>
    <div>
      <!-- in here there is a list of information such as: 
           height: xx  price: xx , and a link <a> to the page -->
    </div>
  </li>
  <li>
    <!-- next item -->
  .
  .
  .

我希望根据一组条件（高度＆gt; x，价格＆lt; x）对列表中的每个项目进行排序，如果项目匹配，请点击链接。

我需要引用一个标记作为另一个标记的子标记（即第一个

元素的子元素）

我很确定解决方案是沿着其中一条线，但我不使用哪种库和/或方法：

1 - 使用某个库我将列表解析为一个对象，以便我可以这样做：

for item in list:
  if item['price'] < x:
    br.follow_link(item.link)

2-我寻找html响应，直到找到第一个“价格”文本，解析值并对其进行限定，如果符合条件，请按照与html字符串相邻的链接（在我的情况下，链接出现在信息之前，因此我需要选择匹配信息之前出现的链接。

我可以想到一些超级蛮力，低级别，这样做的方法，但我想知道是否有我可以使用的库或机械化方法。谢谢

Answer 1

您可以使用名为BeautifulSoup的库。这将是使用Beautiful Soup解析时代码的大纲。

假设你的html是：

<ul>
  <li>
    <div>
        height: 10  price: 20
        <a href="google.com">
    </div>
  </li>
  <li>
    <div>
        height: 30  price: 40
        <a href="facebook.com">
    </div>
  </li>
  <li>
    <div>
        height: 50  price: 60
        <a href="stackoverflow.com">
    </div>
  </li>
</ul>

您要解析的代码是：

from bs4 import BeautifulSoup

# Read the input file. I am assuming the above html is part of test.html
html = ""
with open('test.html', 'r') as htmlfile:
    for line in htmlfile:
        html += line
htmlfile.close()

bs = BeautifulSoup(html)
links_to_follow = []


ul = bs.find('ul')
for li in ul.find_all('li'):
    height = int(li.find('div').get_text().strip().split()[1])
    price = int(li.find('div').get_text().strip().split()[3])
    if height > 10 and price > 20: # I am assuming this to be the criteria
        links_to_follow.append(li.find('a').get('href'))

print links_to_follow

这给出了：

facebook.com
stackoverflow.com

机械化 - 如何根据相邻标签选择链接？

1 个答案: