将HTML转换为TXT

时间:2019-01-31 14:39:46

标签: python html python-requests

我正在尝试将HTML页面转换为文本并将其存储在文件中。我可以,但是文件中有一些随机的斜线和星号。

这是我正在使用的代码

import html2text 
from bs4 import BeautifulSoup
import requests as r 


url = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html")

# print(html2text.html2text(url.text))
web_text = url.text
file = open('text', 'w+')
file.write(html2text.html2text(web_text.replace("** \----", "")))
file.close()

这是我得到的输出。

HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018

FROM: JONNY HAMMOND / AFFINITY TANKERS



HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018

===========================================================================



DATE  VESSEL           DWT YR PORT           OPEN  FLEET       COMMENT  

\----  \------           \--- -- ----           \----  \-----       \-------  

23/10 **KRISJANIS VALDEMA 37 07 MALTA           23/10 LATVIAN     SUBS**  

预期格式

HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018

FROM: JONNY HAMMOND / AFFINITY TANKERS



HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018

===========================================================================



DATE  VESSEL           DWT YR PORT           OPEN  FLEET       COMMENT       

----  ------           --- -- ----           ----  -----       -------       

23/10 KRISJANIS VALDEMA 37 07 MALTA          23/10 LATVIAN     SUBS  

2 个答案:

答案 0 :(得分:1)

您可以使用replace删除不必要的符号:

from html2text import html2text
import requests as r

html = r.get("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html").text
text = html2text(html).replace('*', '').replace('\-', '')
with open('text.txt', 'w') as f:
    f.write(text)

输出为:

HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018

FROM: JONNY HAMMOND / AFFINITY TANKERS



HANDY & MR FUEL OIL POSITIONS BASIS MALTA, AS OF TUESDAY, 23RD OCTOBER 2018

===========================================================================



DATE  VESSEL           DWT YR PORT           OPEN  FLEET       COMMENT


---  -----           -- -- ----           ---  ----       ------  

23/10 KRISJANIS VALDEMA 37 07 MALTA           23/10 LATVIAN     SUBS  



25/10 SEAVALOUR          47 07 GREECE         23/10 THENAMARIS  SUBS

答案 1 :(得分:0)

如果不需要使用<div class="module-wrapper"> <table class="module-table"> <tr> <td class="module"><div class="module-number one-one"><a href="#">1.1</a></div></td> <td class="module"><div class="module-number one-two"><a href="#">1.2</a></div></td> <td class="module"><div class="module-number one-three"><a href="#">1.3</a></div></td> </tr> <tr> <td class="module"><div class="module-number one-four"><a href="#">1.4</a></div></td> <td class="module"><div class="module-number one-five"><a href="#">1.5</a></div></td> <td class="module"><div class="module-number one-six"><a href="#">1.6</a></div></td> </tr> <tr> <td class="module"><div class="module-number one-seven"><a href="#">1.7</a></div></td> <td class="module"><div class="module-number one-eight"><a href="#">1.8</a></div></td> <td class="module"><div class="module-number one-nine"><a href="#">1.9</a></div></td> </tr> </table> </div>,则可以使用beatifulsoup库进行渲染。我认为,将html转换为文本更可靠。

html2text

使用请求库编辑代码修复:

import html2text

htmlForRender = open("https://dev.bizlem.io:8082/scorpio1/HANDY_AND_MR_FUEL_OIL_POSITIONS_BASIS_MALTA_AS_OF_TUESDAY_23RD_OCTOBER_2018_1.html").read()

print html2text.html2text(htmlForRender)