我正在编写一个脚本来解析文本文件,试图将其标准化,以便能够将其插入到数据库中。数据代表一位或多位作者撰写的文章。我遇到的问题是因为没有固定数量的作者,我在输出文本文件中得到了可变数量的列。例如
author1, author2, author3, this is the title of the article
author1, author2, this is the title of the article
author1, author2, author3, author4, this is the title of the article
这些结果给出了最大列数5.因此,对于前两篇文章,我需要添加空白列,以便输出具有偶数列。最好的方法是什么?我的输入文本是制表符分隔的,我可以通过拆分选项卡轻松地遍历它们。
答案 0 :(得分:2)
假设你已经有了最大列数并且已经将它们分成了列表(我假设你将它们放入自己的列表中),你应该能够使用list.insert(-1) ,item)添加空列:
def columnize(mylists, maxcolumns):
for i in mylists:
while len(i) < maxcolumns:
i.insert(-1,None)
mylists = [["author1","author2","author3","this is the title of the article"],
["author1","author2","this is the title of the article"],
["author1","author2","author3","author4","this is the title of the article"]]
columnize(mylists,5)
print mylists
[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]
使用列表推导不会破坏原始列表的替代版本:
def columnize(mylists, maxcolumns):
return [j[:-1]+([None]*(maxcolumns-len(j)))+j[-1:] for j in mylists]
print columnize(mylists,5)
[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]
答案 1 :(得分:1)
如果我误解了,请原谅我,但听起来你正在以困难的方式解决问题。将文本文件转换为将标题映射到一组作者的字典非常容易:
>>> lines = ["auth1, auth2, auth3, article1", "auth1, auth2, article2","auth1, article3"]
>>> d = dict((x[-1], x[:-1]) for x in [line.split(', ') for line in lines])
>>> d
{'article2': ['auth1', 'auth2'], 'article3': ['auth1'], 'article1': ['auth1', 'auth2', 'auth3']}
>>> total_articles = len(d)
>>> total_articles
3
>>> max_authors = max(len(val) for val in d.values())
>>> max_authors
3
>>> for k,v in d.iteritems():
... print k
... print v + [None]*(max_authors-len(v))
...
article2
['auth1', 'auth2', None]
article3
['auth1', None, None]
article1
['auth1', 'auth2', 'auth3']
然后,如果你真的想,你可以使用内置于python的csv module输出这些数据。或者,您可以直接输出您将需要的SQL。
您多次打开同一个文件,并多次读取它,只是为了获得可以从内存中的数据中获得的计数。请不要为了这些目的多次阅读文件。