使用Python 3提取包含版权符号的html标记中的文本©

时间:2018-07-13 20:09:00

标签: python python-3.x web-scraping beautifulsoup

我需要检查网页上是否有版权符号©,如果是,我提取包含该符号的标签文本。例如,对于网页“ profile.theguardian.com/signin”,目标文本为“©2018 Guardian News and Media Limited或其附属公司。保留所有权利”。使用Python 3.x怎么做?

3 个答案:

答案 0 :(得分:0)

您好,您应该在提交问题时张贴示例代码,但以下内容应说明版权标志是否在特定页面上:

from bs4 import BeautifulSoup
import urllib.request


masterURL = 'https://profile.theguardian.com/signin'

sauce = urllib.request.urlopen(masterURL).read()
soup = BeautifulSoup(sauce,'lxml')
temp = soup.prettify().encode('UTF-8')

#\xc2\xa9 is unicode symbol for copyright sign

if(b'\xc2\xa9' in temp):
     print('Copy Right On Page')
else:
     print('No Copy Right on Page')

答案 1 :(得分:0)

以此为footer_copyright,您可以做到:

from bs4 import BeautifulSoup
import urllib.request as url
BeautifulSoup(url.urlopen(masterURL).read()).select("p.footer__copyright")

答案 2 :(得分:0)

我终于找到了我想要的解决方案;

URL = 'https://profile.theguardian.com/signin'
webpage = requests.get(URL)
soup = BeautifulSoup(webpage.content,'html.parser')
symbol = u'\N{COPYRIGHT SIGN}'.encode('utf-8')
symbol = symbol.decode('utf-8')
pattern = r'' + symbol
for tag in soup.findAll(text=re.compile(pattern)):
        copyrightTexts = tag.parent.text
        print(copyrightTexts)

希望这对其他人有帮助。感谢那些尝试提供帮助的人。