2016年10月4日的问题更新

Question

我需要通过requests.get()解析多个html。我只需要保留页面的内容并摆脱嵌入式javascript和css。我看到以下帖子，但没有解决方案适合我。 http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python，http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text和http://stackoverflow.com/questions/2081586/web-scraping-with-python

我有一个工作代码，不会删除js css ...这是我的代码...

count = 1
for link in clean_urls[:2]:
    page = requests.get(link, timeout=5)
    try:
        page = BeautifulSoup(page.content, 'html.parser').text
        webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
        webpage_out.write(clean_page)
        count += 1
    except:
        pass


webpage_out.close()

我尝试在上面提到的链接中包含解决方案，但没有代码适用于我。什么行代码可以摆脱嵌入式js和嵌入式css

2016年10月4日的问题更新

read.csv这样的文件......

trump,clinton
data science, operating system
windows,linux
diabetes,cancer

我用gigablast.com用这些术语来搜索当时的一行。一次搜索将是trump clinton。结果是一个网址列表。我requests.get(url)并处理这些网址，删除timeouts，status_code = 400s，并构建一个干净的clean_urls = []列表。之后，我发出以下代码......

count = 1
for link in clean_urls[:2]:
    page = requests.get(link, timeout=5)
    try:
        page = BeautifulSoup(page.content, 'html.parser').text
        webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
        webpage_out.write(clean_page)
        count += 1
    except:
        pass


webpage_out.close()

在这行代码page = BeautifulSoup(page.content, 'html.parser').text上，我有整个网页的文本，包括嵌入的样式和脚本。我无法使用BeautifulSoup来定位它们，因为标签不再存在。我确实尝试page = BeautifulSoup(page.content, 'html.parser')和find_all('<script>')并试图摆脱脚本，但我最终删除了整个文件。期望的结果将是html的所有文本，没有任何...

body {
    font: something;
}

或任何javascript ...

$(document).ready(function(){
    $some code
)};

最终文件应该没有代码，只有文档的内容。

Answer 1

我在清除HTML页面时使用了这段代码摆脱了JavaScript和CSS代码

WHERE FOLDER_NAME= %s or FILE_NAME = %s

Beautifulsoup在html中摆脱了嵌入式js和css

2016年10月4日的问题更新

1 个答案: