所以我有一个看起来像这样的html文档:
<title>Speaker Name: Title of Talk | Subtitle | website.com</title>
... [Other Stuff]
<div class='meta'><span class='meta__item'>
Posted
<span class='meta__val'>
Jun 2006
</span></span><span class='meta__row'>
Rated
<span class='meta__val'>
Funny, Informative
</span></span></div>
<div class='talk-article__body talk-transcript__body'> TEXT
<data class='talk-transcript__para__time'>15:57</data>
我有2200个这样的文件,我希望将它们全部放入一个包含AUTHOR,TITLE,DATE,LENGTH和TEXT列的CSV文件中。现在,我所拥有的不是最漂亮的代码,但它有效:
from bs4 import BeautifulSoup as soup
soup = soup(open(file).read(), "lxml")
at = soup.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
text = soup.find("div", attrs={ "class" : "talk-article__body"}) # still needs cleaning
date =
length =
我不能为我的生活弄清楚如何得到这个日期:我怀疑它是soup
和re
的组合,但我承认我无法绕过组合。
长度的技巧是我想要找到的是文件中出现的最后时间<data class='talk-transcript__para__time'>
并获取该值。
答案 0 :(得分:2)
你可以试试这个
date_spans = soup.find_all('span', {'class' : 'meta__val'})
date = [x.get_text().strip("\n\r") for x in date_spans if re.search(r"(?s)[A-Z][a-z]{2}\s+\d{4}", x.get_text().strip("\n\r"))][0]
print(date)
#date = re.findall(r"(?s)<span class=.*?>\s*([A-Z][a-z]{2}\s+\d{4})", str(soup))
length_data = soup.find_all('data', {'class' : 'talk-transcript__para__time'})
length = [x.get_text().strip("\n\r") for x in length_data if re.search(r"(?s)\d{2}:\d{2}", x.get_text().strip("\n\r"))][-1]
print(length)
#length = re.findall(r"(?s).*<data class=.*?>(.*)</data>", str(soup))
<强>输出强>
Jun 2006
15:57
答案 1 :(得分:2)
如果第一个元数据是日期,那么您不需要日期的正则表达式,您当然不需要它,因为您可以使用类名talk-transcript__para__time
:< / p>
from bs4 import BeautifulSoup
h = """<title>Speaker Name: Title of Talk | Subtitle | website.com</title>
<div class='meta'><span class='meta__item'>
Posted
<span class='meta__val'>
Jun 2006
</span></span><span class='meta__row'>
Rated
<span class='meta__val'>
Funny, Informative
</span></span></div>
<div class='talk-article__body talk-transcript__body'> TEXT
<data class='talk-transcript__para__time'>15:57</data>"""
soup = BeautifulSoup(h,"html.parser")
date = soup.select_one("span.meta__val").text
time = soup.select_one("data.talk-transcript__para__time").text
print(date, time)
输出:
(u'\nJun 2006\n', u'15:57')
如果你使用正则表达式,你会将它传递给find或find_all:
org.apache.commons.fileupload.MultipartStream$MalformedStreamException: Stream ended unexpectedly
r = re.compile(r"(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{4}")
soup = BeautifulSoup(h, "html.parser")
date = soup.find("span", {"class": "meta__val"}, text=r).text.strip()
哪会给你:
'Jun 2006'