所以当我写出文字时,我可以采用以下格式捕获日期:
text = "The event takes place from May 14-June 11, 2018"
match = re.search(r'(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2}\-(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', text).group()
print(match)
'May 14-June 11, 2018'
但是我真正想要的是使用beautifulsoup和regex从html页面的任何位置提取日期,但是尽管文本确实存在于html中,但我似乎无法复制上面的成功。我是新手,所以也许我缺少明显的东西。
open_page = driver.get(url)
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')
date = soup.find(text=re.compile('(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2}\-(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}'))
print(date)
'None'
我也尝试过:
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')
text = soup.get_text().strip()
match = re.search(r'(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2}\-(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', text)
print(match)
'None'
HTML是:
<div class="event--date">
The event takes place from May 14-June 11, 2018 </div>
但是我不想依赖标签,因为我将跨域进行操作。
当我打印时
soup = BeautifulSoup(html_source, 'html.parser')
text = soup.get_text().strip()
print(text)
'@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide:not(.ng-hide-animate){display:none !important;}ng\:form{display:block;}.ng-animate-shim{visibility:hidden;}.ng-anchor{position:absolute;}
The event takes place from May 14–June 11, 2018 '