我怎样才能在Beautifulsoup找到兄弟姐妹?

时间:2017-01-16 01:58:28

标签: python beautifulsoup web-crawler

以下代码是简化的html代码。

 <html>
  ...
  <div class="info">
   <span class="time">2017.01.16</span>
  </div>
  <div class="related_group">
   <ul class="related_list">
    <li>
     <p class="info">
      <span class="time">2016.12.28</span>
     </p>
    </li>
   </ul>
  </div>
  <div class="info">
   <span class="time">2017.01.26</span>
  </div>
  <div class="related_group">
   <ul class="related_list">
    <li>
     <p class="info">
      <span class="time">2017.01.30</span>
     </p>
    </li>
   </ul>
  </div>
  ...
 </html>

这种模式重复了很多次,我希望得到像这样的数据 2017.01.16 2017.01.26

所以我在python中使用了Beautiful Soup。

for item in soup.find_all("span", {"class" : "time"}):
    source=source+str(item.find_all(text=True))

此代码可查找日期数据,但也会找到无用的数据 2016.12.28 2017.01.30

为了获得更精确的结果,我尝试了 find_next_siblings

for item in soup.find_next_siblings("span", {"class" : "time"}):
    source=source+str(item.find_next_siblings())

你可能知道,它不起作用。 当然,我搜索参考并阅读它。 我不能理解因为缺乏英语.. 如果你不介意,你能帮我解决一下代码吗?

3 个答案:

答案 0 :(得分:1)

试试这个:

from bs4 import BeautifulSoup

html=""" <html>

  <div class="info">
   <span class="time">2017.01.16</span>
  </div>

  <div class="related_group">
   <ul class="related_list">
    <li>
     <p class="info">
      <span class="time">2016.12.28</span>
     </p>
    </li>
   </ul>
  </div>

  <div class="info">
   <span class="time">2017.01.26</span>
  </div>

  <div class="related_group">
   <ul class="related_list">
    <li>
     <p class="info>
      <span class="time">2017.01.30</span>
     </p>
    </li>
   </ul>
  </div>

 </html>"""

soup = BeautifulSoup(html)
s = soup.find_all('div', class_=['info', 'related_group'])
s = iter(s)

for a in s:
    print a.text.strip(), '---', next(s).text.strip()

输出:

2017.01.16 --- 2016.12.28
2017.01.26 --- 2017.01.30

答案 1 :(得分:0)

public class DTO1
{
    public int Id { get; set; }
    public string Name { get; set; }
}

public class DTO2
{
    public int Id { get; set; }
    public string Name { get; set; }
}

public class DTO1Service
{
    public static List<DTO1> GetListOfDTO1()
    {
        return new List<DTO1>
        {
            new DTO1 { Id = 1, Name = "DTO 1" },
            new DTO1 { Id = 2, Name = "DTO 2" }
        };
    }
}

public class DTO2Service
{
    public static List<DTO2> GetListOfDTO2()
    {
        return new List<DTO2>
        {
            new DTO2 { Id = 1, Name = "DTO 1" },
            new DTO2 { Id = 2, Name = "DTO 2" }
        };
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        var entities = new List<dynamic>();

        var serviceType = Console.ReadLine();

        if(serviceType == "1")
            entities = (dynamic)DTO1Service.GetListOfDTO1();
        else if (serviceType == "2")
            entities = (dynamic)DTO2Service.GetListOfDTO2();

        Console.ReadLine();
    }
}

出:

soup.find_all('div', class_='info')

您想要的标记位于[<div class="info"> <span class="time">2017.01.16</span> </div>, <div class="info"> <span class="time">2017.01.26</span> </div>] 标记下。

答案 2 :(得分:0)

这个怎么样:

times = []
items = soup.find_all('div', {"class" : "info"})
for item in items:
    tmp = item.select(".time")
    t = tmp[0].text
    times.append(t)