Question

我正在尝试从存储在目录中的一组.html页面创建数据语料库。

这些HTML页面包含许多我不需要的信息。

此信息全部存储在行

之前

<div class="channel">

如何以编程方式删除

之前的所有文本

<div class="channel">

在文件夹中的每个HTML文件中？

获得50分奖金的奖金问题：

如何以编程方式删除所有内容，例如

<div class="footer">

所以，如果我的index.html之前是：

<head>
   <title>This is bad HTML</title>
</head>
<body>
  <h1> Remove me</h1>
  <div class="channel">
    <h1> This is the good data, keep me</h1>

    <p> Keep this text </p>

  </div>
  <div class="footer">
    <h1> Remove me, I am pointless</h1>
  </div>
</body>

我的脚本运行后，我希望它是：

  <div class="channel">
    <h1> This is the good data, keep me</h1>

    <p> Keep this text </p>

  </div>

Answer 1

这对内存使用有点沉重，但它确实有效。基本上你打开目录，获取所有“.html”文件，将它们读入变量，找到分割点，在变量中存储之前或之后，然后覆盖文件。

尽管如此，可能有更好的方法可以做到这一点，但它确实有效。

import os

dir = os.listdir(".")

files = []

for file in dir:
    if file[-5:] == '.html':
        files.insert(0, file)


for fileName in files:
    file = open(fileName)
    content = file.read()
    file.close()

    loc = content.find('<div class="channel">')

    newContent = content[loc:]

    file = open(fileName, 'w')
    file.write(newContent)
    file.close()

如果你想跟上一点：

newContent = content[0:loc - 1] # I think the -1 is needed, not sure

请注意，您要搜索的内容应保存在变量中，而不是硬编码。

此外，对于文件/文件夹结构，这不会递归地起作用，但您可以找到如何修改它以便非常轻松地执行此操作。

Answer 2

删除上面的所有内容以及下面的所有内容这意味着唯一剩下的就是这一部分：

<div class="channel">
    <h1> This is the good data, keep me</h1>
    <p> Keep this text </p>
</div>

而不是想要删除不需要的东西，而只是提取想要的东西会更容易。您可以使用XML解析器（如DOM

）轻松提取通道div

Answer 3

您在问题中未提及某种语言 - 该帖子标有python，因此此答案可能仍然不合时宜，但我会提供php解决方案可能很容易用另一种语言重写。

$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
$result = $search.$components[1];
return $result;

反过来也很容易;在将$components[0]更改为$search后，只需取<div class="footer">的值即可。

如果碰巧多次出现$search字符串：

$html='....'; // your page
$search='<div class="channel">';
$components = explode($search,$html); // [0 => before the string, 1 => after the string]
unset($components[0]);
$result = $search.implode($search,$components);
return $result;

比我更了解python的人可以随意重写并接受答案！

以编程方式删除HTML节点之前的所有内容？

3 个答案: