Question

假设有一个名为likes_and_dislikes的字符串，可视化格式化为如下所示的表格。

如何解析字符串并返回喜欢和不喜欢的元组列表。还必须从元组列表中删除顶部标题（喜欢，不喜欢）。

likes_and_dislikes="""

+------------------------------------+-----------------------------------+
| likes                              | dislikes                          |
+------------------------------------+-----------------------------------+
| Meritocracy                        | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration  | Ego-driven rhetoric, drama and FUD|
|                                    | to get one's way                  |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure      |
| capable of attracting top-tier     | managers compensating for a weak, |
| talent                             | immature team                     |
+------------------------------------+-----------------------------------+  """

Answer 1

这里的关键是彻底检查表格并理解你想要提取的内容。

首先，逐行完成时，解析这样的字符串通常更容易，因此需要根据表行进行拆分，然后根据该行解析列。我们之所以这样做，主要是因为喜欢和喜欢跨越各界。

1。获取每一行

我们不知道这个表有多宽，所以我们使用正则表达式来分解我们的表：

pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail

这为我们提供了一个与我们的多行行相对应的数组。最后的数组切片会删除标题以及我们不想处理的任何尾随空格。但是，我们仍然存在将单元格中跨越多行的字符串拉到一起的问题。

2。找到喜欢和不喜欢

如果我们遍历这个行数组，我们知道每一行都有一个类似的，一个不喜欢跨越未知数组的行。我们初始化这个并且不喜欢每个作为一个数组，以便最后使连接更快。

for p in pairs:
  like,dislike = [],[]

3。处理每一行

使用我们的行，我们需要根据换行符拆分它，然后根据管道（|）进行拆分。

  for l in p.split('\n'):
    pair = l.split('|')

4。拉出每个喜欢和不喜欢

如果我们给出的那对有多个值，那么我们必须有一对喜欢或不喜欢捕获。所以请将它附加到我们的like和dislike数组 - 不喜欢或不喜欢，因为它们包含我们最终格式化的字符串。我们还应该执行strip on these to remove any trailing or leading whitespace。

    if len(pair) > 1:
      # Not a blank line
      like.append(pair[1].strip())
      dislike.append(pair[2].strip())

5。创建最终文本

完成行处理后，我们可以join the strings使用一个空格，最后可以将这些行添加到likes和dislikes数组中。

  if len(like) > 0:
    likes.append(" ".join(like))
  if len(dislike) > 0:
    dislikes.append(" ".join(dislike))

6。使用我们的新数据结构

现在我们可以使用这两个新列表来处理我们选择的任何一个，或者单独打印每个列表......

from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)

...或zip() them together创建一系列喜欢和不喜欢的内容！

print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)

完整代码：

likes_and_dislikes="""

+------------------------------------+-----------------------------------+
| likes                              | dislikes                          |
+------------------------------------+-----------------------------------+
| Meritocracy                        | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration  | Ego-driven rhetoric, drama and FUD|
|                                    | to get one's way                  |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure      |
| capable of attracting top-tier     | managers compensating for a weak, |
| talent                             | immature team                     |
+------------------------------------+-----------------------------------+ """

import re
likes,dislikes = [],[]
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
for p in pairs:
  like,dislike = [],[]
  for l in p.split('\n'):
    pair = l.split('|')
    if len(pair) > 1:
      # Not a blank line
      like.append(pair[1].strip())
      dislike.append(pair[2].strip())
  if len(like) > 0:
    likes.append(" ".join(like))
  if len(dislike) > 0:
    dislikes.append(" ".join(dislike))
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)

这导致：

Likes:
[   'Meritocracy',
    'Healthy debates and collaboration ',
    'Autonomy given by confident leaders capable of attracting top-tier talent']
Dislikes:
[   'Favoritism, ass-kissing, politics',
    "Ego-driven rhetoric, drama and FUD to get one's way",
    'Micro-management by insecure managers compensating for a weak, immature team']
A set of paired likes and dislikes
[   ('Meritocracy', 'Favoritism, ass-kissing, politics'),
    (   'Healthy debates and collaboration ',
        "Ego-driven rhetoric, drama and FUD to get one's way"),
    (   'Autonomy given by confident leaders capable of attracting top-tier talent',
        'Micro-management by insecure managers compensating for a weak, immature team')]

您可以看到complete code in action on codepad。

Answer 2

这是ReST中使用的表格格式之一（重构文本，一种pythonic形式的标记），并且有各种各样的解析器可以使用它。

这是主要的python.org网站上的一个：http://www.python.org/scripts/ht2html/docutils/parsers/rst/tableparser.py

将漂亮的打印表解析为Python对象

2 个答案: