我有以下数据,并且无法想到在Python中合并它的解决方案:
数据如下所示:
ID OFFSET TEXT
1 1 This text is short
2 1 This text is super long and got cut by the database s
2 2000 o it will come out like this
3 1 I'm short too
我一直在尝试使用csv.DictReader和csv.DictWriter。
答案 0 :(得分:0)
使用itertools.groupby
按ID分组,然后加入文本:
import itertools
import operator
#dr is the DictRreader
for dbid, rows in itertools.groupby(dr, key=operator.itemgetter('ID')):
print(dbid, ''.join(row['TEXT'] for row in rows))
答案 1 :(得分:0)
groupby 将创建元组,其中元组值是按 ID 列出的 TEXT 项列表。
txt="""ID,OFFSET,TEXT
1, 1, This text is short
2, 1, This text is super long and got cut by the database s
2, 2000, o it will come out like this
3, 1, I'm short too
"""
from io import StringIO
f = StringIO(txt)
df = pd.read_table(f,sep =',')
df.set_index('ID',inplace=True)
for my_tuple in df.groupby(df.index)['TEXT']:
lst=[item.strip() for item in my_tuple[1]]
print(". ".join(lst))
print("\n")
输出:
This text is short
This text is super long and got cut by the database s. o it will come out like this
I'm short too
答案 2 :(得分:-1)
类csv.DictReader
和csv.DictWriter
用于CSV文件,虽然您可以让它们读取固定的列描述文件(如您显示的文件),但它并不是必需的并且可能会复杂化的东西。
假设记录正常,您需要做的就是:
Python可以在没有模块的情况下完成所有这些工作。
这是一个初步的方法:
text="""
ID OFFSET TEXT
1 1 This text is short
2 1 This text is super long and got cut by the database s
2 2000 o it will come out like this
3 1 I'm short too
""".strip()
lines = text.splitlines()
columns = lines.pop(0) # don't need the columns
result = dict()
for line in lines:
# the maxsplit arg is important to keep all the text
id, offset, text = line.split(maxsplit=2)
if id in result:
result[id] += text
else:
result[id] = text
print("Result:")
for id, text in result.items():
print(f"ID {id} -> '{text}'")
这使用了Python 3.6 f-strings,但如果你愿意,你可以得到相同的结果,例如:
...
print("ID %s -> '%s'" % (id, text)
无论哪种方式,结果都是:
Result:
ID 1 -> 'This text is short'
ID 2 -> 'This text is super long and got cut by the database so it will come out like this'
ID 3 -> 'I'm short too'
条件检查if id in result
是"确定"但你可以使用defaultdict
来避免它:
from collections import defaultdict
result = defaultdict(str)
for line in lines:
id, offset, text = line.split(maxsplit=2)
result[id] += text # <-- much better
print("Result:")
for id, text in result.items():
print(f"ID {id} -> '{text}'")
collections
包有许多方便的实用程序。