使用beautifulsoup和regex从字符串获取日期,此刻获取None

时间:2018-07-31 04:08:00

标签: python regex date beautifulsoup

所以当我写出文字时,我可以采用以下格式捕获日期:

text = "The event takes place from May 14-June 11, 2018"
match = re.search(r'(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2}\-(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', text).group()
print(match)
'May 14-June 11, 2018'

但是我真正想要的是使用beautifulsoup和regex从html页面的任何位置提取日期,但是尽管文本确实存在于html中,但我似乎无法复制上面的成功。我是新手,所以也许我缺少明显的东西。

open_page = driver.get(url)
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')
date = soup.find(text=re.compile('(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2}\-(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}'))
print(date)
'None'

我也尝试过:

html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')
text = soup.get_text().strip()
match = re.search(r'(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2}\-(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', text)
print(match)
'None'

HTML是:

<div class="event--date">
       The event takes place from May 14-June 11, 2018        </div>

但是我不想依赖标签,因为我将跨域进行操作。

当我打印时

soup = BeautifulSoup(html_source, 'html.parser')
text = soup.get_text().strip()
print(text)
'@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide:not(.ng-hide-animate){display:none !important;}ng\:form{display:block;}.ng-animate-shim{visibility:hidden;}.ng-anchor{position:absolute;}


The event takes place from May 14–June 11, 2018    '

0 个答案:

没有答案