这个论坛报废代码如何运作?

时间:2017-05-16 10:24:20

标签: python web-scraping beautifulsoup

我要感谢用户Pythonista几个月前给我这个非常有用的代码解决了我的问题。然而,由于我缺乏对HTML和Beautiful汤库的了解,我仍然对代码的功能感到困惑。

我对于特定消息数据结构在这个程序中扮演什么角色感到困惑?

我还对代码如何保存各种帖子感到困惑? 以及如何检查帖子的用户?

import requests, pprint
from bs4 import BeautifulSoup as BS

url = "https://forums.spacebattles.com/threads/the-wizard-of-woah-and-the-impossible-methods-of-necromancy.337233/"
r = requests.get(url)
soup = BS(r.content, "html.parser")

#To find all posts from a specific user everything below this is for all posts
specific_messages = soup.findAll('li', {'data-author': 'The Wizard of Woah!'})


#To find every post from every user
posts = {}

message_container = soup.find('ol', {'id':'messageList'})
messages = message_container.findAll('li', recursive=0)
for message in messages:
    author = message['data-author']
    #or don't encode to utf-8 simply for printing in shell
    content = message.find('div', {'class':'messageContent'}).text.strip().encode("utf-8")
    if author in posts:
        posts[author].append(content)
    else:
        posts[author] = [content]
pprint.pprint(posts)

1 个答案:

答案 0 :(得分:1)

specific_messages = soup.findAll(' li',{' data-author':' The Woah!'})

  1. soup是解析html
  2. 所需的BeautifulSoup对象
  3. findAll()是一个函数,用于查找您在html代码中传递的所有参数
  4. li是需要找到的标签。
  5. data-author是html属性,将在
  6. 标签
  7. 中搜索
  8. 哇哇哇!是作者姓名。
  9. 所以基本上该行正在搜索所有

  10. 标签,其属性为data-author,其名称为Woah!

    并且findall返回多行,因此您需要遍历它以便您可以获取每一行并将其附加到列表中。

    那就是