I have extracted data between span tags based on its class, from a webpage. But at times, the webpage splits a line into multiple fragments and stores it in consecutive tags. All the children span tags have the same class name.
Following is the HTML snippet:
<p class="Paragraph SCX">
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
This week
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
(12/
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
11
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
- 12/1
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
7
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
):
</span>
</span>
<span class="EOP SCX">
</span>
</p>
From the above HTML snippet, I need to extract only the innermost span data.
Python code to extract data using BS4:
for data in elem.find_all('span', class_="TextRun"):
a = data.find('span').contents[0]
a = a.string.replace(u'\xa0', '')
print (a)
events_parsed_thisweek.append(a)
This code results in each data being separately printed as separate entity. Required Output:
This Week ((12/11 - 12/17):
Any idea how to combine these span tag data together? Thanks!
答案 0 :(得分:1)
放手一搏。确保将整个html
包装在content
变量中。
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'lxml')
data = ''.join([' '.join(item.text.split()) for item in soup.select(".NormalTextRun")])
print(data)
输出:
This week(12/11- 12/17):
答案 1 :(得分:0)
You could try combining the relevant information together in a string using the join method.
[RegularExpression(@"^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$", ErrorMessage = "Please enter a valid e-mail adress")]
[System.Web.Mvc.Remote("CheckEmailExists_TeamMember", "TeamManagement", ErrorMessage = "Team Member With Same Email Already Exist!")]
[Required(ErrorMessage = "Email is Required")]
public string Email { get; set; }