Question

我是编程和StackOverflow的完全初学者，我只需要从TripAdvisor页面进行一些基本的网页抓取，并从中清除一些有用的信息。很好地展示它等我试图隔离咖啡馆的标题，评级的数量和评级本身。我以为我可能需要将其转换为文本并使用正则表达式或其他东西？我真的不知道。我的意思是：

输出：

Coffee Cafe，4分，共5分，201条评论。

这样的事情。我将把我的代码放到目前为止，我能得到的任何帮助都会令人惊讶，我将无限感激。欢呼声。

from bs4 import BeautifulSoup

def get_HTML(url):
    response = urllib.request.urlopen(url)
    html = response.read()
    return html


Tripadvisor_reviews_HTML=get_HTML(
'https://www.tripadvisor.com.au/Restaurants- 
 g255068-c8-Brisbane_Brisbane_Region_Queensland.html')


def get_review_count(HTML):
    soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
    for element in soup(attrs={'class' : 'reviewCount'}):
        print(element)

get_review_count(Tripadvisor_reviews_HTML)

def get_review_score(HTML):
    soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
    for four_point_five_score in soup(attrs={'alt' : '4.5 of 5 bubbles'}):
        print(four_point_five_score)


get_review_score(Tripadvisor_reviews_HTML)

def get_cafe_name(HTML):
    soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
    for name in soup(attrs={'class' : "property_title"}):
        print(name)



get_cafe_name(Tripadvisor_reviews_HTML)

Answer 1

您忘记在每个打印语句中使用Private Sub Check_Group_Click() Me.Range("UI_GROUP_RISK").Value = IIf(Me.Check_Group.Value, "No", "Yes") End Sub。但是，请尝试以下方法从该站点获取所有三个字段。

.text

结果你可能会这样：

from bs4 import BeautifulSoup
import urllib.request

URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"

def get_info(link):
    response = urllib.request.urlopen(link)
    soup = BeautifulSoup(response.read(),"lxml")
    for items in soup.find_all(class_="shortSellDetails"):
        name = items.find(class_="property_title").get_text(strip=True)
        bubble = items.find(class_="ui_bubble_rating").get("alt")
        review = items.find(class_="reviewCount").get_text(strip=True)
        print(name,bubble,review)

if __name__ == '__main__':
    get_info(URL)

基本的Python BeautifulSoup网络抓取Tripadvisor评论和数据清理

1 个答案: