Question

我正在尝试使用beautifulsoup刮擦一个公开的facebook组，我在使用移动网站，因为那里缺少javascript。因此，该脚本应该从'more'关键字获取链接，并从此处的p标签获取文本，但仅从当前页面的p标签获取文本。有人可以指出我的问题吗？我是python以及此代码中所有内容的新手。

   from selenium import webdriver
   from selenium.webdriver.common.keys import Keys
   from selenium.common.exceptions import NoSuchElementException
   from bs4 import BeautifulSoup
   import requests
   browser = webdriver.Firefox()
   browser.get('https://mobile.facebook.com/groups/22012931789?refid=27')
   for elem in browser.find_elements_by_link_text('More'):
      page = requests.get(elem.get_attribute("href"))
      soup=BeautifulSoup(page.content,'html.parser')
      print(soup.find_all('p')[0].get_text())

Answer 1

查看脚本的实际作用总是很有用的，一种快速的方法是在执行过程中的某些步骤打印结果。

例如，使用您的代码：

for elem in browser.find_elements_by_link_text('More'):
    print("elem's href attribute: {}".format(elem.get_attribute("href")))

您会注意到第一个空白。我们应该在尝试获取请求之前对此进行测试：

for elem in browser.find_elements_by_link_text('More'):
    if elem.get_attribute("href"):
        print("Trying to get {}".format(elem.get_attribute("href")))
        page = requests.get(elem.get_attribute("href"))
        soup=BeautifulSoup(page.content,'html.parser')
        print(soup.find_all('p')[0].get_text())

请注意，空的elem.get_attribute("href")返回空的unicode字符串u''-但pythons认为空字符串为false，这就是if起作用的原因。

哪个可以在我的机器上正常工作。希望有帮助！

硒python beautifulsoup卡在当前页面上

1 个答案: