应用错误收集

有没有一种方法可以从随机网页中抓取HTML，然后使其仅在文本中可见？

时间：2020-05-26 02:29:09

标签： python html web-crawler

我的思维方式是正则表达式

data = re.sub('[^0-9a-zA-Z\\s\\.\\,]', '', string=html).lower()
data = re.sub('<[^>]*>', '', string=html)
data = re.sub('[^ ㄱ-ㅣ가-힣]+', '', string=html)

但是，数字可能不可见，空格可能太长。

如果有更好的方法，我将不胜感激。

0 个答案:

没有答案