我应该如何使用BeautifulSoup在页面上特定dt标签之间的dd标签中抓取文本?

时间:2018-09-30 18:28:28

标签: python html python-3.x web-scraping beautifulsoup

我正在尝试从dd标签(用于标记不同日期)之间的dd类提取文本。我尝试了一种真正的hackey方法,但效果不够令人满意

timeDiv = mezzrowSource.find_all("dd", class_="orange event-date")
eventDiv = mezzrowSource.find_all("dd", class_="event")
index = 0 
for time in timeDiv:
    returnValue[timeDiv[index].text] = eventDiv[index].text.strip()
    if "8" in timeDiv[index+3].text or "4:30" in timeDiv[index+3].text:
        break
    index += 1 

以这种方式进行枚举在大多数情况下会导致其他文本产生过多的文本,但有时会从其他日期中提取事件。该部分的源代码粘贴在下面。有什么想法吗?

  <dt class="purple">Sun, September 30th, 2018</dt>

    <dd class="orange event-date">4:30 PM to 7:00 PM</dd>
    <dd class="event"><a href="/events/4094-mezzrow-classical-salon-with-david-oei"
                         class="event-title">Mezzrow Classical Salon with David Oei</a>


    </dd>

    <dd class="orange event-date">8:00 PM to 10:30 PM</dd>
    <dd class="event"><a href="/events/4144-luke-sellick-ron-blake-adam-birnbaum"
                         class="event-title">Luke Sellick, Ron Blake &amp; Adam Birnbaum</a>


    </dd>

    <dd class="orange event-date">11:00 PM to 1:00 AM</dd>
    <dd class="event"><a href="/events/4099-ryo-sasaki-friends-after-hours"
                         class="event-title">Ryo Sasaki &amp; Friends &quot;After-hours&quot;</a>


    </dd>


  <dt class="purple">Mon, October 1st, 2018</dt>

    <dd class="orange event-date">8:00 PM to 10:30 PM</dd>
    <dd class="event"><a href="/events/4137-greg-ruggiero-murray-wall-steve-little"
                         class="event-title">Greg Ruggiero, Murray Wall &amp; Steve Little</a>


    </dd>

    <dd class="orange event-date">11:00 PM to 1:00 AM</dd>
    <dd class="event"><a href="/events/4174-pasquale-grasso-after-hours"
                         class="event-title">Pasquale Grasso &quot;After-hours&quot;</a>


    </dd>

预期的输出是这样的字典:{'4:30 PM to 7:00 PM':'Mezzrow Classical Salon with David Oei','8:00 PM to 10:30 PM':'Greg Ruggiero ,默里·沃尔(Murray Wall)和史蒂夫·利特夫(Steve Little),“从11:00 PM到1:00 AM”:“ Pasquale Grasso“下班后””}

2 个答案:

答案 0 :(得分:1)

如果我对问题的理解正确,则可以使用zip():

let timerId;
function displayTime() {
    timerId = setInterval(() => {
        // your code

    }, 1000);
}

document.querySelector('button').addEventListener('click', displayTime)

输出:

mezzrowSource = BeautifulSoup(html , 'lxml')
timeDiv = [tag.get_text() for tag in mezzrowSource.find_all("dd", class_="orange event-date")]
eventDiv = [tag.get_text().strip() for tag in mezzrowSource.find_all("dd", class_="event")]
print(dict(zip(timeDiv, eventDiv)))

已更新:

您要从中获取数据的元素都是同级元素,即没有包含每个数据集的元素,这使得按需要对数据进行分组变得更加困难。您唯一喜欢的事实是带有日期的元素首先出现,然后是时间,然后是标题。时间和标题可以重复。因此,此方法选择了我们想要的所有元素并对其进行迭代。在第一次迭代中,它将日期存储在字符串中,并创建一个包含时间和标题的元组列表。下次找到日期时,会将日期和元组列表追加到字典中。在迭代结束时,它将最终日期和元组列表追加到字典中。有点混乱,但这是由于HTML中缺乏结构。

{'4:30 PM to 7:00 PM': 'Mezzrow Classical Salon with David Oei', '8:00 PM to 10:30 PM': 'Greg Ruggiero, Murray Wall & Steve Little', '11:00 PM to 1:00 AM': 'Pasquale Grasso "After-hours"'}

输出:

from bs4 import BeautifulSoup
import requests
import re
import pprint

url = 'https://www.mezzrow.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text , 'lxml')
ds = soup.find_all(True, {'class': re.compile('purple|event|orange event_date')})
ret = {}
tmp = []
i = None
for d in ds:
    if d.attrs['class']==['purple']:
        if i is not None:
            ret[i] = tmp
            tmp = []
        i = (d.get_text())
    elif d.attrs['class']==['orange', 'event-date']:
        j =  d.get_text()
    elif d.attrs['class']==['event']:
        tmp.append ((j,d.get_text(strip=True)))
ret[i] = tmp
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(ret)

然后从dict对象中选择所需的日期。

答案 1 :(得分:1)

您可以访问此页面以获取我编写的全新HTML Scrape软件包(Java)。 Java在世界上比Python更好,如果您不同意,则取决于您!

  

在这里下载:   http://developer.torello.directory/JavaHTML/index.html

import Torello.HTML.*;
import Torello.Java.*;

import java.util.*;
import java.util.regex.*;
import java.io.*;

public class ScrapeDD
{
    public static void main(String[] argv) throws IOException
    {
        Pattern P = Tags.getPattern("dd", "class");
        String ddData = FileRW.loadFileToString("DDData.html");
        Vector<HTMLNode> page = HTMLPage.getPageTokens(ddData, false);
        int ddPos = -1;
        while (true)
        {
            ddPos = TagNodeFind.first(page, ddPos + 1, -1, TC.OpeningTags, "dd");
            if (ddPos == -1) break;
            Vector<HTMLNode> ddPair = TagNodeGet.firstInclusive(page, ddPos, -1, "dd");
            System.out.println("DD.class = " + Tags.getInnerTagValue((TagNode) page.elementAt(ddPos), P));
            for (HTMLNode n : ddPair)
                if (n instanceof TextNode) if (n.str.trim().length() > 0)
                    System.out.println(Escape.replaceAll(n.str));
        }
    }
}

Produces this output:
DD.class = orange event-date
4:30 PM to 7:00 PM
DD.class = event
Mezzrow Classical Salon with David Oei
DD.class = orange event-date
8:00 PM to 10:30 PM
DD.class = event
Luke Sellick, Ron Blake & Adam Birnbaum
DD.class = orange event-date
11:00 PM to 1:00 AM
DD.class = event
Ryo Sasaki & Friends "After-hours"
DD.class = orange event-date
8:00 PM to 10:30 PM
DD.class = event
Greg Ruggiero, Murray Wall & Steve Little
DD.class = orange event-date
11:00 PM to 1:00 AM
DD.class = event
Pasquale Grasso "After-hours"