Question

我正在将ePub转换为单个HTML文件，因此我需要将各个章节连接成一个HTML文件。名称为“..._ split_000.html”等，我设置了各种结构来迭代ToC，生成目录名等等。

我希望通过将以下部分的body元素的内容附加到第一部分的主体来使用Beautifulsoup连接各个部分的HTML内容。只有我的代码似乎不起作用。 “book”是ebooklib的epub类的一个实例。 “docsfiles”是一个字典，其中HTML文件的名称作为键，文件列表作为一个值包含在其中：

def concat_articles(book, docsfiles, toc):
    articles = {}
    for doc, val in docsfiles.iteritems():
       firstsoup = False
       for f in val['files']:
           content = book.get_item_with_href(f).content
           soup = BeautifulSoup(content, "html.parser")
           if not firstsoup:
               firstsoup = soup
               continue
           body = copy.copy(soup.body)
           firstsoup.body.append(body)
       articles[val['id']] = firstsoup.prettify("utf-8")
    return articles

当我在我的ePub上运行时，会发生错误：

Traceback (most recent call last):
  File "extract-new.py", line 170, in <module>
    articles_html = concat_articles(book, docsfiles, toc)
  File "extract-new.py", line 97, in concat_articles
    firstsoup.body.append(body)
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 338, in append
    self.insert(len(self.contents), tag)
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 291, in insert
    new_child.extract()
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 235, in extract
    del self.parent.contents[self.parent.index(self)]
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 888, in index
    raise ValueError("Tag.index: element not in tag")
ValueError: Tag.index: element not in tag

实际上我应该在上面的代码中展开（）so so soup.body但导致另一个错误，所以我想我会先解决这个问题。

Answer 1

当我使用Martijn Peters＆＃39; ＆＃34;克隆（）＆＃34;来自this StackOverflow post的方法：

 body = clone(soup.body)
 firstsoup.body.append(body)

为什么会这样做？＆＃34; copy.copy（）＆＃34;没有，我还没弄明白。

没有重复身体标签的完整工作解决方案如下所示：

       body = clone(soup.body)
       for child in body.contents:
          firstsoup.body.append(clone(child))

当我使用＆＃34; copy.copy（）＆＃34;在第一行但不是在我替换＆＃34; clone（）＆＃34; by＆＃34; copy.copy（）＆＃34;在最后一行。

Answer 2

可能为时已晚，但我遇到了类似的问题，发现了一个更简单的解决方案。请使用 str（）函数将使用 BeautifulSoup 提取的所有对象转换为字符串。

Beautifulsoup：ValueError：Tag.index：元素不在标签中

2 个答案: