Question

我正在开发一个数据挖掘项目，我需要在论坛的一个主题中分析讨论的进度。我有兴趣提取信息，如帖子的时间，帖子的作者统计数据（帖子的数量，加入日期等），帖子的文本等。

然而，当使用标准的抓取工具（比如python中的Scrapy）时，我需要编写正则表达式来检测页面的html源代码中的这些字段。由于这些标签因论坛的类型而异，因此解决每个论坛的正则表达式成为一个主要问题。是否有可用的正则表达式的标准库，以便可以根据论坛的类型使用它们？

或者是否有其他技术可以从论坛的页面中提取这些字段。

Answer 1

我为一些主要论坛写了一些配置文件。希望你能破译并推断如何解析它。

对于VBulletin：

enclosed_section=tag:table,attributes:id;threadslist
thread=tag:a,attributes:id;REthread_title_
list_next_page=type:next_page,attributes:anchor_text;&gt;
post=tag:div,attributes:id;REpost_message_
thread_next_page=type:next_page,attributes:anchor_text;&gt;

enclosed_section是包含指向所有线程的链接的div thread是您可以找到每个线程的链接的地方 list_next_page是带有线程列表的下一页的链接 post是带有帖子文本的div。 thread_next_page是指向线程下一页的链接

对于Invision：

enclosed_section=tag:table,attributes:id;forum_table
thread=tag:a,attributes:class;topic_title
list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post=tag:div,attributes:class;post entry-content |
thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
post_count_section=tag:td,attributes:class;stats
post_count=tag:li,attributes:,reg_exp:(\d+) Repl

Answer 2

您仍需要为每个论坛创建多种方法。但正如亨利建议的那样，也有很多论坛分享他们的结构。

关于轻松解析论坛主题的日期，dateparser诞生于此特定要求，可能会有很大的帮助。

从论坛中的线程中提取特定字段

2 个答案: