我有一个很大的csv
,其中包含我用pandas导入到python的数据。
csv
的前3行如下所示。
“PATIENT”,"MD",“REFMD”,“DIAGNOSIS_HISTORY”,“AVAILABLE_STUDIES”
“patient1\nPID1\npAge1”,“MDname1\nMDname3”,” RefDoctorName1”,“Prostate cancer”,”No Path\n CT ClinicName (CAP) - 11/30/2015\n Nuclear: ClinicName (Bone Scan) - 11/30/2015"
"patient2\nPID2\npAge2”,”MDname2\nSeen 10/12/2015”,“RefDoctorName2”,”Prostate cancer”,”Path: O/S - Prostate Bx 11/12/2014”
我想
column
分为“PATIENT_Name”,“PID”,“pAGE”和我已将所有列拆分,但下一步很难。
import pandas as pd
f = pd.read_csv("/path/file.csv")
pat = f.iloc[0:,:1]
refmd = f.iloc[0:,2:3]
diag = f.iloc[0:,3:4]
答案 0 :(得分:0)
您可以从以下内容开始:
df.columns = [re.sub(r'[^A-Za-z0-9\\]+', '', c).strip() for c in df.columns]
for i, col in df.items():
df.loc[:, i] = col.str.replace(r'[^A-Za-z0-9\\ ]+', '').str.strip()
得到:
PATIENT MD REFMD \
0 patient1\nPID1\npAge1 MDname1\nMDname3 RefDoctorName1
1 patient2\nPID2\npAge2 MDname2\nSeen 10122015 RefDoctorName2
DIAGNOSISHISTORY AVAILABLESTUDIES
0 Prostate cancer No Path\n CT ClinicName CAP 11302015\n Nucle...
1 Prostate cancer Path OS Prostate Bx 11122014
将split
和expand
转换为columns
个字符的新newline
:
pat = df.iloc[:, 0].str.split(r'\\n', expand=True)
pat.columns = ['PATIENT_name', 'PID', 'pAGE']
PATIENT_name PID pAGE
0 patient1 PID1 pAge1
1 patient2 PID2 pAge2
和
md = df.iloc[:, 1].str.split(r'\\n', expand=True)
md.columns = ['MD', 'MD2']
MD MD2
0 MDname1 MDname3
1 MDname2 Seen 10122015