仅从html页面中提取单词

时间:2014-12-29 05:51:13

标签: javascript python html css python-2.7

我正在使用python 2.7,我有一个包含html页面列表的文件夹,我只想从中提取单词。目前,我正在使用的过程是打开html文件,通过漂亮的汤库运行它,获取文本并将其写入新文件。但问题是我仍然在输出中获得javascript,css(正文,颜色,#000000 .etc),符号(|,`,〜,[] .etc)和随机数。

如何摆脱不需要的输出并仅获取文本?

path = *folder path*
raw = open(path + "/raw.txt", "w")
files = os.listdir(path)
for name in files:
    fname = os.path.join(path, name)
    try:
        with open(fname) as f:
            b = f.read()
            soup = BeautifulSoup(b)
            txt = soup.body.getText().encode("UTF-8")
            raw.write(txt)

1 个答案:

答案 0 :(得分:1)

可以删除脚本和样式标记

import requests
from bs4 import BeautifulSoup

session = requests.session()

soup = BeautifulSoup(session.get('http://stackoverflow.com/questions/27684020/extracting-only-words- from-html-pages').text)

#This part here will strip out the script and style tags.
for script in soup(["script", "style"]):
script.extract()

print soup.get_text()