Question

我是Python的新手，我真的很想学习更多知识。我正在通过我目前正在做的课程来完成任务......

请编写一个小型Python脚本，针对特定应用列表抓取Google Play网上应用店（https://play.google.com/store），并将应用商店列表信息存储在输出文件夹中。
该脚本应从应用页面中提取以下信息：icon，title，description和screenshots。
我应该可以通过以下命令运行脚本：python app_fetcher.py <app_id>。然后，元数据应存储在当前目录中的文件夹中（例如./<app_id>）
奖励积分！还可以获取应用商店列出字幕或其他您感兴趣的内容。

我已经开始这个了，但我不确定如何真正去做网页抓取部分的脚本。任何人都可以提供建议。我不知道要使用哪些库或函数来调用。我看过网上但都涉及安装其他软件包。这是我到目前为止，任何帮助将不胜感激!!! ...

# Function to crawl Google Play Store and obtain data
def web_crawl(app_id):
 import os, sys, urllib2
 try:
  # Obtain the URL for the app
  url = "https://play.google.com/store/apps/details?id=" + app_id

  # open url for reading
  response = urllib2.urlopen(url)

  # Get path of py file to store txt file locally
  fpath = os.path.dirname(os.path.realpath(sys.argv[0]))

  # Open file to store app metadata
  with open(fpath + "\web_crawl.txt", "w") as f:
     f.write("Google Play Store Web Crawler \n")
     f.write("Metadata for " + app_id + "\n")
     f.write("***************************************  \n")
     f.write("Icon: "  + "\n")
     f.write("Title: " + "\n")
     f.write("Description: "  + "\n")
     f.write("Screenshots: "  + "\n")

     # Added subtitle 
     f.write("Subtitle: "  + "\n")

     # Close file after write
     f.close()
   except urllib2.HTTPError, e:
   print("HTTP Error: ")
   print(e.code)
  except urllib2.URLError, e:
    print("URL Error: ")
    print(e.args)

# Call web_crawl function
web_crawl("com.cmplay.tiles2")

Answer 1

我建议你使用BeautifulSoup。首先，使用此代码

from bs4 import BeautifulSoup
r = requests.get("url");
# optionally check status code here
soup = BeautifulSoup(r.text)

使用汤对象，您可以使用选择器从页面中提取元素

在此处阅读更多内容：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

创建python web scraper以获取Google Play商店应用的元数据

1 个答案: