Question

抓取一些论坛帖子

我想将每个帖子文本及其作者和时间戳记录到csv文件中。

我正在使用Beautiful Soup，但不可否认我是python和web scraping的初学者。我现在拥有的代码获取了必填字段，但仅限于第一篇文章。我需要该帖子上所有帖子的信息。我尝试了 soup.find_all（）和 soup.select（），但我没有得到理想的结果。

这是我正在使用的代码：

from bs4 import BeautifulSoup
import urllib2 

print "Reading URL..."
url = urllib2.urlopen("http://pantip.com/topic/35647305")
content = url.read()
soup = BeautifulSoup(content, "html.parser")

print "Finding desired HTML..."
table = soup.select("abbr.timeago")

print "\nScraped HTML is:"
print table

text = BeautifulSoup(str(table).strip(),"html.parser").get_text().encode("utf-8").replace("\n", "")
print "\nScraped text is:\n" + text

任何关于我做错的线索都会深表感谢。此外，欢迎任何关于如何以更好，更清洁的方式做到这一点的建议。

如上所述，我是初学者，所以请不要介意任何愚蠢的错误。： - ）

谢谢！

Answer 1

使用Ajax请求呈现注释：

import requests
from bs4 import BeautifulSoup

params = {"tid": "35647305", # in the url
          "type": "3"}

with requests.Session() as s:    
    s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36",
                         "X-Requested-With": "XMLHttpRequest"})
    r = (s.get("http://pantip.com/forum/topic/render_comments", params=params))
    data = r.json() # data["comments"] contains what you want

这将为您提供所有数据。所以你需要的是从每个url传递 tid 并更新params dict中的tid。

使用BeautifulSoup刮刮Pantip论坛

1 个答案: