Question

我需要实施网页抓取。在beautifulsoup中第一次工作。请求一个URL，得到了包含另一个URL，日期和标题的结果。现在我需要从第一个结果中提取的网址中获取结果。

选择URL并要求相同。我需要选择所有的p标签，因此添加为find_all（'p'）

def get_inner_urlData(self,link_url):
    link_page=urllib.request.urlopen(link_url)
    link_soup=BeautifulSoup(link_page, 'html.parser')
    link_content=[]
    for p_tag in link_soup.find_all('p'):
     #p_tag.find('script').decompose()
      print(p_tag.replace_with())`

在打印输出时显示：

<p><script> bla bla </script></p>
<p> this is a correct para</p>
<p> this is a correct para </p>

如何避免使用带有脚本标签的p标签，我用分解脚本标签显示了一些错误，例如：

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()

Answer 1

尚不清楚代码在哪里失败，但是从另一个元素中删除脚本元素的通常方法是找到所有A/DEBUG: #00 pc 0000965c /data/app/com.dev-uvtomKbV4Z3rmhQFiqJ81Q==/lib/x86/libapp.so (monitor_pid(void*)+12) 2019-05-16 16:40:24.646 12904-12904/? A/DEBUG: #01 pc 0008f065 /system/lib/libc.so (__pthread_start(void*)+53) 2019-05-16 16:40:24.646 12904-12904/? A/DEBUG: #02 pc 0002485b /system/lib/libc.so (__start_thread+75)元素和script：

decompose

如何在p标签中排除内部标签

1 个答案: