我想从CD数据中提取post_id
<script type='text/javascript' data-cfasync='false'>
//<![CDATA[
_SHR_SETTINGS = {"endpoints":{"local_recs_url":"https:\/\/klaudynahebda.pl\/wp-admin\/admin-ajax.php?action=shareaholic_permalink_related"},"url_components":{"year":"2018","monthnum":"06","day":"19","post_id":"21132","postname":"letnie-warsztaty-ziolowo-kosmetyczne-7-9lipiec","author":"admin"}};
//]]>
</script>
我能够获取整个CData,但不知道下一步该怎么做?
答案 0 :(得分:1)
也许这不是一个超级解决方案,但我明白了
from bs4 import BeautifulSoup
html = """
<script type='text/javascript' data-cfasync='false'>
//<![CDATA[
_SHR_SETTINGS = {"endpoints":{"local_recs_url":"https:\/\/klaudynahebda.pl\/wp-admin\/admin-ajax.php?action=shareaholic_permalink_related"},"url_components":{"year":"2018","monthnum":"06","day":"19","post_id":"21132","postname":"letnie-warsztaty-ziolowo-kosmetyczne-7-9lipiec","author":"admin"}};
//]]>
</script>
"""
soup = BeautifulSoup(html, 'lxml')
dct = {}
for scr in soup.find_all('script'):
for x in scr.text.split(','):
if 'post_id' in x:
k, v = x.replace('"', '').split(':')
dct[k] = v
print(dct['post_id'])
输出
21132
答案 1 :(得分:1)
如果您只需要post_id
,请尝试使用regex
。
例如:
import re
s = """<script type='text/javascript' data-cfasync='false'>
//<![CDATA[
_SHR_SETTINGS = {"endpoints":{"local_recs_url":"https:\/\/klaudynahebda.pl\/wp-admin\/admin-ajax.php?action=shareaholic_permalink_related"},"url_components":{"year":"2018","monthnum":"06","day":"19","post_id":"21132","postname":"letnie-warsztaty-ziolowo-kosmetyczne-7-9lipiec","author":"admin"}};
//]]>
</script>"""
m = re.search(r'(?<="post_id":\")(?P<post_id>.*?)(?=\",\")', s)
if m:
print(m.group('post_id'))
输出:
21132