Question

我正在尝试编写一个获取网站代码的脚本，将所有html保存在文件中，然后提取一些信息。

目前我已完成第一部分，我已将所有html保存到文本文件中。

现在我必须提取相关信息，然后将其保存在另一个文本文件中。

但是我遇到编码问题......而且我也不太清楚如何在python中提取文本。

解析网站：

import urllib.request

... 用于存储数据的文件名

file_name = r'D:\scripts\datos.txt'

我希望获得此标记之后和另一个标记之前的文本

tag_starts_with = '<p class="item-description">'
tag_ends_with = '</p>'

我获取网站代码并将其保存为文本文件

with urllib.request.urlopen("http://www.website.com/") as response, open(file_name, 'wb') as out_file:
    data = response.read() 
    out_file.write(data)

print (out_file)＃第一个问题我该如何打印文件？给我一个错误，我不能打印字节

该文件现在已经充满了html文本，因此我想打开并处理它

file_for_results = open(r'D:\scripts\datos.txt',encoding="utf8")

从文件中提取信息

第二个问题如何对包含该文件的行进行子串并获取p class =“item-description”之间的文本和 / p所以我可以存储在file_for_results

中

这是我无法编码的伪代码。

for line in file_to_filter:
    if line contains word_starts_with
      copy in file_for_results until you find </p>

先谢谢您的帮助

Answer 1

我假设这是某种类型的赋值，你需要解析给定算法的html，如果不是只使用Beautiful Soup。

伪代码实际上很容易转换为python代码：

file_to_filter = open("file.html", 'r')
out_file = open("text_output",'w')
for line in file_to_filter:
    if word_starts_with in line:
        print(line, end='', file=out_file) # Store data in another file
    if word_ends_with in line:
        break

当然，您需要关闭文件，确保删除标签等等，但这大致是您的代码应该使用此算法。

如何使用python从字节文件中提取文本

1 个答案: