假设我有一个以下格式的大文本文件
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
总共有8个键,它们始终按我的“姓”,“名称”,“年龄”,“体重”,“身高”,“学校”,“兄弟姐妹”,“行情”的顺序排列事先知道。如您所见,某些配置文件没有完整的变量集。您可以确定的唯一名称就是名称。
我想创建一个熊猫数据框,将每个观察结果作为一行,将每一列作为关键字。对于James,由于他在“ School”和“ Sibling”中没有条目,所以我希望这些单元格的条目是numpy nan对象。
我的尝试是为每个变量使用类似(?:\[Surname: \"()\"\])
的东西。但是,即使对于姓氏的单个案例,我也遇到了问题。如果姓氏不存在,则仅返回空白列表,不返回占位符。
更新:
作为一个例子,我希望返回monica的个人资料 ('','Monica','','33','','','','我期待圣诞节')
答案 0 :(得分:1)
您可以解析文件数据,将结果分组并传递到数据框:
import re
import pandas as pd
def group_results(d):
_group = [d[0]]
for a, b in d[1:]:
if a == 'Name' and not any(c == 'Name' for c, _ in _group):
_group.append([a, b])
elif a == 'Surname' and any(c == 'Name' for c, _ in _group):
yield _group
_group = [[a, b]]
else:
if a == 'Name':
yield _group
_group = [[a, b]]
else:
_group.append([a, b])
yield _group
headers = ["Surname","Name","Age","Weight","Height","School","Siblings","Quote"]
data = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
parsed = [(lambda x:[x[0], x[-1][1:-1]])(re.findall('(?<=^\[)\w+|".*?"(?=\]$)', i)) for i in data]
_grouped = list(map(dict, group_results(parsed)))
result = pd.DataFrame([[c.get(i, "") for i in headers] for c in _grouped], columns=headers)
输出:
Surname Name ... Siblings Quote
0 Gordon James ... I want to be a pilot
1 Monica ... I am looking forward to christmas
[2 rows x 8 columns]
答案 1 :(得分:0)
基于@WiktorStribiżew注释,您可以使用groupby(来自itertools)将行分组为空行和数据行,例如:
import re
from itertools import groupby
text = '''[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
[Name: "John"]
[Height: "33"]
[Quote: "I am looking forward to christmas"]
[Surname: "Gordon"]
[Name: "James"]
[Height: "44"]
[Quote: "I am looking forward to christmas"]'''
patterns = [re.compile('(\[Surname: "(?P<surname>\w+?)"\])'),
re.compile('(\[Name: "(?P<name>\w+?)"\])'),
re.compile('(\[Age: "(?P<age>\d+?)"\])'),
re.compile('\[Weight: "(?P<weight>\d+?)"\]'),
re.compile('\[Height: "(?P<height>\d+?)"\]'),
re.compile('\[Quote: "(?P<quote>.+?)"\]')]
records = []
for non_empty, group in groupby(text.splitlines(), key=lambda l: bool(l.strip())):
if non_empty:
lines = list(group)
record = {}
for line in lines:
for pattern in patterns:
match = pattern.search(line)
if match:
record.update(match.groupdict())
break
records.append(record)
for record in records:
print(record)
输出
{'weight': '46', 'quote': 'I want to be a pilot', 'age': '13', 'name': 'James', 'height': '12', 'surname': 'Gordon'}
{'weight': '33', 'quote': 'I am looking forward to christmas', 'name': 'Monica'}
{'height': '33', 'quote': 'I am looking forward to christmas', 'name': 'John'}
{'height': '44', 'surname': 'Gordon', 'quote': 'I am looking forward to christmas', 'name': 'James'}
注意:这将创建一个字典,其中的键是字段名,值是每个值,此格式与您的预期输出不匹配,但我相信比您更完整要求。无论如何,您都可以轻松地从这种格式转换为所需的元组格式。
说明
itertools中的groupby函数将输入数据分为连续的空行和 record 行组。然后,您只需要处理不为空的组。如果模式匹配中断,则尝试匹配模式的每一行的处理都是简单的,假设每行匹配的行都是专用的,则利用命名组使用字段的值更新record
字典。
答案 2 :(得分:0)
您可以重写数据文件。代码将原始文件解析为D类,然后使用csv.DictWriter将其写入普通样式的csv中,该样式应可由熊猫读取:
创建演示文件:
fn = "t.txt"
with open (fn,"w") as f:
f.write("""
[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]
""")
中级班:
class D:
fields = ["Surname","Name","Age","Weight","Height","Quote"]
def __init__(self,textlines):
t = [(k.strip(),v.strip()) for k,v in (x.strip().split(":",1) for x in textlines)]
self.data = {k:"" for k in D.fields}
self.data.update(t)
def surname(self): return self.data["Surname"]
def name(self): return self.data["Name"]
def age(self): return self.data["Age"]
def weight(self): return self.data["Weight"]
def height(self): return self.data["Height"]
def quote(self): return self.data["Quote"]
def get_data(self):
return self.data
解析和重写:
fn = "t.txt"
# list of all collected D-Instances
data = []
with open(fn) as f:
# each dataset contains all lines belonging to one "person"
dataset = []
surname = False
for line in f.readlines():
clean = line.strip().strip("[]")
if clean and (clean.startswith("Surname") or clean.startswith("Name")):
if any(e.startswith("Name") for e in dataset):
data.append(D(dataset))
dataset = []
if clean:
dataset.append(clean)
else:
if clean:
dataset.append(clean)
elif clean:
dataset.append(clean)
if dataset:
data.append(D(dataset))
import csv
with open("other.txt", "w", newline="") as f:
dw = csv.DictWriter(f,fieldnames=D.fields)
dw.writeheader()
for entry in data:
dw.writerow(entry.get_data())
检查所写内容:
with open("other.txt","r") as f:
print(f.read())
输出:
Surname,Name,Age,Weight,Height,Quote
"""Gordon""","""James""","""13""","""46""","""12""","""I want to be a pilot"""
,"""Monica""",,"""33""",,"""I am looking forward to christmas"""
答案 3 :(得分:0)
使用re.findall()为每个信息块创建(键,值)元组的列表,并将它们放在单独的词典中:
text="""[Surname: "Gordon"]
[Name: "James"]
[Age: "13"]
[Weight: "46"]
[Height: "12"]
[Quote: "I want to be a pilot"]
[Name: "Monica"]
[Weight: "33"]
[Quote: "I am looking forward to christmas"]"""
keys=['Surname','Name','Age','Weight','Height','Quote']
rslt=[{}]
for k,v in re.findall(r"(?m)(?:^\s*\[(\w+):\s*\"\s*([^\]\"]+)\"\s*\])+",text):
d=rslt[-1]
if (k=="Surname" and d) or (k=="Name" and "Name" in d):
d={}
rslt.append(d)
d[k]=v
for d in rslt:
print( [d.get(k,'') for k in keys] )
Out:
['Gordon', 'James', '13', '46', '12', 'I want to be a pilot']
['', 'Monica', '', '33', '', 'I am looking forward to christmas']