Question

您好我需要废弃网页结束提取数据ID使用正则表达式

这是我的代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("https://clarity-project.info/tenders/?entity=38163425&offset=100")
bsObj = BeautifulSoup(html,"html.parser")
DataId = bsObg.findAll("data-id", {"skr":re.compile("data-id=[0-9,a-f]")})
for DataId in DataId:
    print(DataId["skr"])

当我在Jupyter中运行我的程序时：

HTTPError: HTTP Error 403: Forbidden

Answer 1

由于默认用户代理，服务器可能会阻止您的请求。您可以更改此设置，以便服务器显示为Web浏览器。例如，Chrome用户代理是：

insert_query = """
    with i as (
        insert into id_table (product_id, publish_date) 
        values (%(product_id)s, %(publish_date)s)
        returning latest_update_id
    )
    insert into product_table (
        latest_update_id,
        product_id,
        note_related_info1,
        note_related_info2
    ) values (
        (select latest_update_id from i),
        %(product_id)s, %(note_related_info1)s, %(note_related_info2)s
    )
    returning *
"""

db_cursor.execute(insert_query, my_dict)

要添加User-Agent，您可以创建一个请求对象，其中url作为参数，User-Agent在字典中作为关键字参数'headers'传递。

请参阅：

Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36

Answer 2

在向Python的urllib提供内容之前，Web服务器似乎要求您进行身份验证。但是，它们可以很好地为wget和curl提供一切，而https://clarity-project.info/robots.txt似乎并不存在，所以我认为这样做很好。不过，首先问问他们可能是个好主意。

至于代码，只需将用户代理字符串更改为他们更喜欢的内容似乎有效：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request

request = Request(
    'https://clarity-project.info/tenders/?entity=38163425&offset=100',
    headers={
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0'})

html = urlopen(request).read().decode()

（无关，您的代码中还有另一个错误：bsObj≠bsObg）

编辑添加了以下代码，以回答评论中的其他问题：

您似乎需要找到data-id属性的值，无论它属于哪个标记。下面的代码就是这样：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup

url = 'https://clarity-project.info/tenders/?entity=38163425&offset=100'
agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36\
(KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'

request = Request(url, headers={'User-Agent': agent})

html = urlopen(request).read().decode()

soup = BeautifulSoup(html, 'html.parser')

tags = soup.findAll(lambda tag: tag.get('data-id', None) is not None)
for tag in tags:
    print(tag['data-id'])

关键是简单地使用lambda表达式作为BeautifulSoup的findAll函数的参数。

Answer 3

你可以试试这个：

#!/usr/bin/env python

from bs4 import BeautifulSoup
import requests 

url = 'your url here'
soup = BeautifulSoup(requests.get(url).text,"html.parser")

for i in soup.find_all('tr', attrs={'class':'table-row'}):
    print '[Data id] => {}'.format(i.get('data-id'))

这应该有效！

Web抓取：HTTPError：HTTP错误403：禁止，python3

3 个答案: