如何正确进行网络抓取以轻松获取所有数据?

时间:2021-06-21 16:46:14

标签: web web-scraping

我是网络抓取的新手。我试图获得一些 pub_ratings。我也想从 yelp 页面获取尽可能多的数据。

这是我的代码:

pub_ratings = []
pub_reviews = []
pub_names = []
num_reviews = []

#for loop for all pages

for i in range(0,240,10):       
    url = "https://www.yelp.ie/search?find_desc=Pubs+%26+Bars&find_loc=london&ns=1&start={}".format(i)
    r = requests.get(url)
    soup_240 = BeautifulSoup(r.content, 'html.parser')
    sleep(1)
    
    all_data = soup_240.findAll('div', class_="container__09f24__21w3G hoverable__09f24__2nTf3 margin-t3__09f24__5bM2Z margin-b3__09f24__1DQ9x padding-t3__09f24__-R_5x padding-r3__09f24__1pBFG padding-b3__09f24__1vW6j padding-l3__09f24__1yCJf border--top__09f24__8W8ca border--right__09f24__1u7Gt border--bottom__09f24__xdij8 border--left__09f24__rwKIa border-color--default__09f24__1eOdn")



#filling them with data

    for data in all_data:
        
        pub_names.append(data.find('a', class_='css-166la90').get_text(separator=' '))  
        num_reviews.append(data.find('span',class_='reviewCount__09f24__EUXPN css-e81eai').get_text(separator=' '))
        pub_ratings.append(data.find('div', aria_label="").get_text(separator=' '))

这是我的错误

<块引用>

AttributeError: 'NoneType' 对象没有属性 'get_text'

enter image description here

1 个答案:

答案 0 :(得分:0)

数据以 Json 形式嵌入页面中。要解析它,您可以使用下一个示例:

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.ie/search?find_desc=Pubs+%26+Bars&find_loc=london&ns=1"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = BeautifulSoup(
    soup.select_one('script[type="application/json"]').contents[0],
    "html.parser",
).contents[0]
data = json.loads(data)

# uncomment to print all data:
# print(json.dumps(data, indent=4))


def search_biz(d):
    if isinstance(d, dict):
        if "bizId" in d:
            yield d["searchResultBusiness"]
        else:
            for v in d.values():
                yield from search_biz(v)
    elif isinstance(d, list):
        for v in d:
            yield from search_biz(v)


for b in search_biz(data):
    print(b["name"])
    print(
        "Rating: {}\nAddress: {}\nPhone: {}\n".format(
            b["rating"], b["formattedAddress"], b["phone"]
        )
    )

打印:

The Harp
Rating: 4.5
Address: 47 Chandos Place
Phone: 020 7836 0291

Cahoots Bar
Rating: 4.5
Address: 13 Kingly Court
Phone: 020 7352 6200

The Monkey Puzzle
Rating: 4.5
Address: 30 Southwick Street
Phone: 020 7723 0143

The Crobar
Rating: 4.5
Address: 17 Manette Street
Phone: 020 7439 0831

The Queen’s Head
Rating: 4
Address: 15 Denman Street
Phone: 020 7437 1540

The Queens Arms
Rating: 4.5
Address: 11 Warwick Way
Phone: 020 7834 3313

The Cauldron
Rating: 4.5
Address: 79 Stoke Newignton Road
Phone: 0117 456 2442

Coach and Horses
Rating: 4
Address: 5 Bruton Street
Phone: 020 7629 4123

The Victoria
Rating: 4.5
Address: 10a Strathearn Place
Phone: 020 7724 1191

The Ordnance
Rating: 4
Address: 29 Ordnance Hill
Phone: 020 7722 0278