Python迭代错误 - 我的BeautifulSoup脚本中的代码不正确,但为什么?

时间:2014-12-05 14:32:34

标签: python html beautifulsoup

import os
from bs4 import BeautifulSoup

do = dir_with_original_files = 'C:\FOLDER'
dm = dir_with_modified_files = 'C:\NEW_FOLDER'

for root, dirs, files in os.walk(do):
    for f in files:
        print f.title()
        if f.endswith('~'): #you don't want to process backups
            continue
        original_file = os.path.join(root, f)
        mf = f.split('.')
        mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name 
                                             # if you omit the last two lines.
                                             # They are in separate directories
                                             # anyway. In that case, mf = f
        modified_file = os.path.join(dm, mf)
        with open(original_file, 'r') as orig_f, \
             open(modified_file, 'w') as modi_f:
            soup = BeautifulSoup(orig_f.read())


            for t in soup.find_all('td', class_='Test'):
                for child in t.find_all("font"):
                    child.string.wrap(soup.new_tag('h2'))

            # This is where you create your new modified file.
            modi_f.write(soup.prettify().encode(soup.original_encoding)) 

我有一些这种形式的HTML:

<td class=Test"> 
   <a href="www.randomsite.com"> </a>
   <font>Text</font>
</td>

在我的BS中,我试图将其更改为:

<td class=Test"> 
   <a href="www.randomsite.com"> </a>
   <font><h2>Text</h2></font>
</td>

请注意,字体中的文本周围添加了<h2>个标记。

然而,对于每个文件,它似乎创建了两个额外的文件,这两个文件都没有按照我的希望正确标记。例如,如果我的某个文件是file1.html,则此代码会创建file1_mod.htmlfile1_mod_mod.html

可能导致这种情况的原因是什么?我已经尝试在PyCharm中逐步执行该文件,但我仍然相对较新的BS / Python,因此调试是一个痛苦的过程。

0 个答案:

没有答案