在Python 3中的csv.DictReader中合并两个几乎相同的行

时间:2017-11-30 16:03:30

标签: python python-3.x export-to-csv

我有以下数据,并且无法想到在Python中合并它的解决方案:

数据如下所示:

ID    OFFSET    TEXT
1     1         This text is short
2     1         This text is super long and got cut by the database s
2     2000      o it will come out like this
3     1         I'm short too

我一直在尝试使用csv.DictReader和csv.DictWriter。

3 个答案:

答案 0 :(得分:0)

使用itertools.groupby按ID分组,然后加入文本:

import itertools
import operator

#dr is the DictRreader
for dbid, rows in itertools.groupby(dr, key=operator.itemgetter('ID')):
    print(dbid, ''.join(row['TEXT'] for row in rows))

答案 1 :(得分:0)

groupby 将创建元组,其中元组值是按 ID 列出的 TEXT 项列表。

txt="""ID,OFFSET,TEXT
1,     1,         This text is short
2,     1,         This text is super long and got cut by the database s
2,     2000,      o it will come out like this
3,     1,         I'm short too
"""

from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =',')

df.set_index('ID',inplace=True)


for my_tuple in df.groupby(df.index)['TEXT']:
    lst=[item.strip() for item in my_tuple[1]]
    print(". ".join(lst))
    print("\n")

输出:

This text is short

This text is super long and got cut by the database s. o it will come out like this

 I'm short too

答案 2 :(得分:-1)

csv.DictReadercsv.DictWriter用于CSV文件,虽然您可以让它们读取固定的列描述文件(如您显示的文件),但它并不是必需的并且可能会复杂化的东西。

假设记录正常,您需要做的就是:

  • 读取每一行(扔掉第一行)
  • 读取ID,偏移量和文本(丢弃偏移量)
  • 如果ID是新的,则存储从ID到文本的映射
  • 如果ID不是新的,请附加文本。

Python可以在没有模块的情况下完成所有这些工作。

这是一个初步的方法:

text="""
ID    OFFSET    TEXT
1     1         This text is short
2     1         This text is super long and got cut by the database s
2     2000      o it will come out like this
3     1         I'm short too
""".strip()

lines = text.splitlines()
columns = lines.pop(0)  # don't need the columns
result = dict()

for line in lines:
    # the maxsplit arg is important to keep all the text
    id, offset, text = line.split(maxsplit=2)
    if id in result:
        result[id] += text
    else:
        result[id] = text

print("Result:")
for id, text in result.items():
    print(f"ID {id} -> '{text}'")

这使用了Python 3.6 f-strings,但如果你愿意,你可以得到相同的结果,例如:

...
    print("ID %s -> '%s'" % (id, text)

无论哪种方式,结果都是:

Result:
ID 1 -> 'This text is short'
ID 2 -> 'This text is super long and got cut by the database so it will come out like this'
ID 3 -> 'I'm short too'

条件检查if id in result是"确定"但你可以使用defaultdict来避免它:

from collections import defaultdict

result = defaultdict(str)
for line in lines:
    id, offset, text = line.split(maxsplit=2)
    result[id] += text  # <-- much better

print("Result:")
for id, text in result.items():
    print(f"ID {id} -> '{text}'")

collections包有许多方便的实用程序。