我有一组标签类型的数据来清理我的研究。每个数据集不是典型的整齐逐列格式,而是每个县的标签格式(如下所示)
1CURRENT DATE: XXX AGE,SEX, RACE AND ETHNICITY OF PERSONS PAGE 1
BEGINNING DATE FOR DATA TOTALS: 01/83 COUNTY 001
ENDING DATE FOR DATA TOTALS: 12/83 RECORD COUNT 36
Gender Age_20 Age_21 Age_22 Age_23 Asian Hispanic White
Robbery F 1 2 2 2 3 3 3
M 3 3 2 2 4 3 3
Fraud F 1 2 2 2 3 3 2
M 2 3 2 2 4 3 3
Arson F 1 2 2 2 3 3 3
M 4 3 2 2 4 3 4
1CURRENT DATE: XXX AGE,SEX, RACE AND ETHNICITY OF PERSONS PAGE 4
BEGINNING DATE FOR DATA TOTALS: 01/83 COUNTY 002
ENDING DATE FOR DATA TOTALS: 12/83 RECORD COUNT 36
Gender Age_20 Age_21 Age_22 Age_23 Asian Hispanic White
Robbery F 1 2 2 2 3 3 3
M 2 3 2 2 4 4 3
Fraud F 1 2 2 2 3 3 2
M 2 3 2 2 4 6 3
Arson F 1 2 2 2 3 3 3
M 4 3 2 2 4 3 4
1CURRENT DATE: XXX AGE,SEX, RACE AND ETHNICITY OF PERSONS PAGE 7
BEGINNING DATE FOR DATA TOTALS: 01/83 COUNTY 003
ENDING DATE FOR DATA TOTALS: 12/83 RECORD COUNT 36
Gender Age_20 Age_21 Age_22 Age_23 Asian Hispanic White
Robbery F 1 2 2 2 3 3 3
M 3 3 2 2 4 3 3
Fraud F 1 2 1 4 3 3 2
M 2 3 2 2 4 3 3
Arson F 1 2 4 2 3 3 3
M 4 3 2 2 4 3 4
由于其标签类型的性质,我无法将这些数据集直接导入excel或stata进行进一步分析。我打算做的是复制并粘贴每个县的ID(即:COUNTY 003,COUNTY 002等)和特定类型的犯罪,以创建一个新的类似列的数据集:
Gender Age_20 Age_21 Age_22 Age_23 Asian Hispanic White County
Robbery F 1 2 2 2 3 2 3 001
Robbery F 1 2 2 2 2 3 3 002
Robbery F 1 2 2 2 3 3 3 003
并进一步清理此新数据集中的数据。
我在网上搜索,发现Python实际上可以将这种文件的特定部分复制并粘贴到新文档中。但我对Python很陌生,我的经验主要是Stata和SPSS。我不确切知道哪些代码会执行此类型的复制粘贴作业。
答案 0 :(得分:0)
您可能希望查看pandas。具体情况将根据您的格式而有所不同,但将数据按摩到更干净的东西并不需要太多。有更漂亮,更少硬编码的方式来做以下事情,但这里有一个几乎意识流的例子:
import pandas as pd
# read in a fixed-width file
df = pd.read_fwf("crime.tsv", widths=[14] + [10]*8, header=None)
# clean up the strings
df = df.applymap(lambda x: x.strip() if isinstance(x, basestring) else x)
# make a new column
df["County"] = None
# move over the county information
df["County"][df[5] == "COUNTY"] = df[6]
# fill the county info forwards into the empty places
df["County"].fillna(method='ffill', inplace=True)
# fill the crime information forwards
df[0].fillna(method='ffill', inplace=True)
# reset the columns from one of the examples
df.columns = ["Crime"] + list(df.ix[3,1:-1]) + ["County"]
# get rid of any of the headings left in the table
df = df[~(df["Gender"] == "Gender")]
# toss anything which still has empty cells
df = df.dropna()
# reset the index, and fix the types
df = df.set_index(["Crime", "Gender", "County"]).astype(int)
df = df.reset_index()
产生
>>> df
Crime Gender County Age_20 Age_21 Age_22 Age_23 Asian Hispanic White
0 Robbery F 001 1 2 2 2 3 3 3
1 Robbery M 001 3 3 2 2 4 3 3
2 Fraud F 001 1 2 2 2 3 3 2
3 Fraud M 001 2 3 2 2 4 3 3
4 Arson F 001 1 2 2 2 3 3 3
5 Arson M 001 4 3 2 2 4 3 4
6 Robbery F 002 1 2 2 2 3 3 3
7 Robbery M 002 2 3 2 2 4 4 3
8 Fraud F 002 1 2 2 2 3 3 2
9 Fraud M 002 2 3 2 2 4 6 3
10 Arson F 002 1 2 2 2 3 3 3
11 Arson M 002 4 3 2 2 4 3 4
12 Robbery F 003 1 2 2 2 3 3 3
13 Robbery M 003 3 3 2 2 4 3 3
14 Fraud F 003 1 2 1 4 3 3 2
15 Fraud M 003 2 3 2 2 4 3 3
16 Arson F 003 1 2 4 2 3 3 3
17 Arson M 003 4 3 2 2 4 3 4
之后我们可以做各种整洁的事情。