Question

我正在编写一个脚本来解析文本文件，试图将其标准化，以便能够将其插入到数据库中。数据代表一位或多位作者撰写的文章。我遇到的问题是因为没有固定数量的作者，我在输出文本文件中得到了可变数量的列。例如

author1, author2, author3, this is the title of the article
author1, author2, this is the title of the article
author1, author2, author3, author4, this is the title of the article

这些结果给出了最大列数5.因此，对于前两篇文章，我需要添加空白列，以便输出具有偶数列。最好的方法是什么？我的输入文本是制表符分隔的，我可以通过拆分选项卡轻松地遍历它们。

Answer 1

假设你已经有了最大列数并且已经将它们分成了列表（我假设你将它们放入自己的列表中），你应该能够使用list.insert（-1），item）添加空列：

def columnize(mylists, maxcolumns):
    for i in mylists:
        while len(i) < maxcolumns:
            i.insert(-1,None)

mylists = [["author1","author2","author3","this is the title of the article"],
           ["author1","author2","this is the title of the article"],
           ["author1","author2","author3","author4","this is the title of the article"]]

columnize(mylists,5)
print mylists

[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]

使用列表推导不会破坏原始列表的替代版本：

def columnize(mylists, maxcolumns):
    return [j[:-1]+([None]*(maxcolumns-len(j)))+j[-1:] for j in mylists]

print columnize(mylists,5)

[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]

Answer 2

如果我误解了，请原谅我，但听起来你正在以困难的方式解决问题。将文本文件转换为将标题映射到一组作者的字典非常容易：

>>> lines = ["auth1, auth2, auth3, article1", "auth1, auth2, article2","auth1, article3"]
>>> d = dict((x[-1], x[:-1]) for x in [line.split(', ') for line in lines])
>>> d
{'article2': ['auth1', 'auth2'], 'article3': ['auth1'], 'article1': ['auth1', 'auth2', 'auth3']}
>>> total_articles = len(d)
>>> total_articles
3
>>> max_authors = max(len(val) for val in d.values())
>>> max_authors
3
>>> for k,v in d.iteritems():
...     print k
...     print v + [None]*(max_authors-len(v))
... 
article2
['auth1', 'auth2', None]
article3
['auth1', None, None]
article1
['auth1', 'auth2', 'auth3']

然后，如果你真的想，你可以使用内置于python的csv module输出这些数据。或者，您可以直接输出您将需要的SQL。

您多次打开同一个文件，并多次读取它，只是为了获得可以从内存中的数据中获得的计数。请不要为了这些目的多次阅读文件。

在Python列表中插入值

2 个答案: