从.docx文件解析表

时间:2015-01-09 13:32:33

标签: python xml parsing docx python-docx

我想使用Python和python-docx将.docx文件中的表解析为一些有用的数据结构。

.docx文件在我的案例中只包含一个表。我uploaded it so you can have a look。这是一个截图:

Books.docx

1 个答案:

答案 0 :(得分:15)

您可以使用下面的代码段将文档解析为一个列表,其中每一行都是一个将表头值映射到列值的字典。

from docx.api import Document

# Load the first table from your document. In your example file,
# there is only one table, so I just grab the first one.
document = Document('Books.docx')
table = document.tables[0]

# Data will be a list of rows represented as dictionaries
# containing each row's data.
data = []

keys = None
for i, row in enumerate(table.rows):
    text = (cell.text for cell in row.cells)

    # Establish the mapping based on the first row
    # headers; these will become the keys of our dictionary
    if i == 0:
        keys = tuple(text)
        continue

    # Construct a dictionary for this row, mapping
    # keys to values for this row
    row_data = dict(zip(keys, text))
    data.append(row_data)

这会给你:

data = [
  {u'Pub.': u'Penguin Books',
   u'Auther': u'Edward de BONO',
   u'Sr. No.': u'1',
   u'Name of Book': u'Six Thinking Hats'
  },
  ...
]

如果您只想为每一行添加一个元组,那么您应该创建一个字典而不是将row_data设置为text的元组值,所以在循环中而不是构造dict,执行:

# Construct a tuple for this row
row_data = tuple(text)
data.append(row_data)

现在,data会保留这样的内容:

data = [
  (u'1',
   u'Six Thinking Hats',
   u'Edward de BONO',
   u'Penguin Books'
  ),
 ...
]

然后你可以跳过构建keys,显然(但仍跳过第一行!)。