How to extract from multiple <span> tags and group the data together using BS4?

时间:2017-12-18 06:16:12

标签: python html beautifulsoup

I have extracted data between span tags based on its class, from a webpage. But at times, the webpage splits a line into multiple fragments and stores it in consecutive tags. All the children span tags have the same class name.

Following is the HTML snippet:

<p class="Paragraph SCX">
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            This week
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            &nbsp;(12/
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            11
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            &nbsp;- 12/1
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            7
        </span>
    </span>
    <span class="TextRun SCX">
        <span class="NormalTextRun SCX">
            ):
        </span>
    </span>
    <span class="EOP SCX">
        &nbsp;
    </span>
</p>

From the above HTML snippet, I need to extract only the innermost span data.

Python code to extract data using BS4:

for data in elem.find_all('span', class_="TextRun"):
    a = data.find('span').contents[0]
    a = a.string.replace(u'\xa0', '')
    print (a)
    events_parsed_thisweek.append(a)

This code results in each data being separately printed as separate entity. Required Output:

This Week ((12/11 - 12/17):

Any idea how to combine these span tag data together? Thanks!

2 个答案:

答案 0 :(得分:1)

放手一搏。确保将整个html包装在content变量中。

from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'lxml')
data = ''.join([' '.join(item.text.split()) for item in soup.select(".NormalTextRun")])
print(data)

输出:

This week(12/11- 12/17):

答案 1 :(得分:0)

You could try combining the relevant information together in a string using the join method.

[RegularExpression(@"^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$", ErrorMessage = "Please enter a valid e-mail adress")]
[System.Web.Mvc.Remote("CheckEmailExists_TeamMember", "TeamManagement", ErrorMessage = "Team Member With Same Email Already Exist!")]
[Required(ErrorMessage = "Email is Required")]
public string Email { get; set; }