Question

我有一项任务是使用python从.docx将表格的列数据提取到.xls或.csv文件表格如下所示

表4-1。 Bite_main.c

CHECK             Function      Line    Colum       Detail
=======================================================
##overflow.2        xxxxxxx     xxx     xxx         xxx
##overflow.5        xxxxxxx     xxx     xxx         xxx
##overflow.8        xxxxxxx     xxx     xxx         xxx
##overflow.12       xxxxxxx     xxx     xxx         xxx

表4-2。 Bite_Engine.c

CHECK           Function    Line    Colum   Detail
overflow.4      xxxxxxx     xxx     xxx     xxx
overflow.9      xxxxxxx     xxx     xxx     xxx
overflow.8      xxxxxxx     xxx     xxx     xxx
overflow.10     xxxxxxx     xxx     xxx     xxx

最初我首先使用＆＃34;猛犸象＆＃34;用于将.docx文件转换为.html文件的库（因为我在许多网站上检查过每个人都将.docx文件转换为html以便更容易处理数据。）

现在我需要提取＆＃34; CHECK＆＃34;列表仅用于从转换后的html文件到.xls或.csv工作表的每个表名称（即表4-1。Bite_main.c）。而它应该在xls表中看起来像这样

1. Bite_main.c      overflow.2,overflow.5,overflow.8,overflow.12
2. Bite_Engine.c    overflow.4,overflow.9,overflow.8,overflow.10

---

我使用下面的代码转换为html

with open("\input.docx", "rb") as docx_file, open("\out_file.html", "w") as myfile:
    result = mammoth.convert_to_html(docx_file,      include_default_style_map=False)
    html = result.value
    myfile.write("%s" % html.encode("utf-8", "ignore")) # here one issue is I am getting all the file data in a single line of HTML file

After conversion, i tried to extract the table buti am not getting idea properly    

raw_html = open("\out_file.html", 'r').read()
        soup = BeautifulSoup(raw_html, "html.parser")
        tables = soup.findAll("table")
        table_list = []
        for table in tables:
            table_dict = {}
            rows = table.findAll("tr")
           count = 0
            for row in rows:
                value_list = []
                entries = row.findAll("td")

当我遇到＆＃34;表4-1时，我没有得到如何提取数据。 Bite_main.c＆＃34;，然后提取＆＃34; CHECK＆＃34;新的xls表单独列。同样的事情，我需要重复所有＆＃34;表4.x. XXX.X＆＃34;

我是Python的新手。请求提供实现上述概念的逻辑或有更好的方法来处理这个问题。提前感谢那些为此做出回应的人。

使用python

表4-1。 Bite_main.c

表4-2。 Bite_Engine.c

---

0 个答案: