复制并粘贴部分选项卡类型数据以创建新文档

时间:2013-03-01 03:45:46

标签: python copy copy-paste

我有一组标签类型的数据来清理我的研究。每个数据集不是典型的整齐逐列格式,而是每个县的标签格式(如下所示)

1CURRENT DATE: XXX               AGE,SEX, RACE AND ETHNICITY OF PERSONS  PAGE    1
 BEGINNING DATE FOR DATA TOTALS: 01/83                    COUNTY    001
 ENDING DATE FOR DATA TOTALS: 12/83                                                                       RECORD COUNT    36
              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White
Robbery       F           1          2          2        2         3         3          3
              M           3          3          2        2         4         3          3
Fraud         F           1          2          2        2         3         3          2
              M           2          3          2        2         4         3          3  
Arson         F           1          2          2        2         3         3          3
              M           4          3          2        2         4         3          4

1CURRENT DATE: XXX               AGE,SEX, RACE AND ETHNICITY OF PERSONS  PAGE    4
 BEGINNING DATE FOR DATA TOTALS: 01/83                    COUNTY    002
 ENDING DATE FOR DATA TOTALS: 12/83                                                                       RECORD COUNT    36
              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White
Robbery       F           1          2          2        2         3         3          3
              M           2          3          2        2         4         4          3
Fraud         F           1          2          2        2         3         3          2
              M           2          3          2        2         4         6          3  
Arson         F           1          2          2        2         3         3          3
              M           4          3          2        2         4         3          4

1CURRENT DATE: XXX               AGE,SEX, RACE AND ETHNICITY OF PERSONS  PAGE    7
 BEGINNING DATE FOR DATA TOTALS: 01/83                    COUNTY    003
 ENDING DATE FOR DATA TOTALS: 12/83                                                                       RECORD COUNT    36
              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White
Robbery       F           1          2          2        2         3         3          3
              M           3          3          2        2         4         3          3
Fraud         F           1          2          1        4         3         3          2
              M           2          3          2        2         4         3          3  
Arson         F           1          2          4        2         3         3          3
              M           4          3          2        2         4         3          4

由于其标签类型的性质,我无法将这些数据集直接导入excel或stata进行进一步分析。我打算做的是复制并粘贴每个县的ID(即:COUNTY 003,COUNTY 002等)和特定类型的犯罪,以创建一个新的类似列的数据集:

              Gender     Age_20    Age_21     Age_22   Age_23    Asian    Hispanic    White    County
Robbery       F           1          2          2        2         3         2          3        001
Robbery       F           1          2          2        2         2         3          3        002
Robbery       F           1          2          2        2         3         3          3        003

并进一步清理此新数据集中的数据。

我在网上搜索,发现Python实际上可以将这种文件的特定部分复制并粘贴到新文档中。但我对Python很陌生,我的经验主要是Stata和SPSS。我不确切知道哪些代码会执行此类型的复制粘贴作业。

1 个答案:

答案 0 :(得分:0)

您可能希望查看pandas。具体情况将根据您的格式而有所不同,但将数据按摩到更干净的东西并不需要太多。有更漂亮,更少硬编码的方式来做以下事情,但这里有一个几乎意识流的例子:

import pandas as pd

# read in a fixed-width file
df = pd.read_fwf("crime.tsv", widths=[14] + [10]*8, header=None)
# clean up the strings
df = df.applymap(lambda x: x.strip() if isinstance(x, basestring) else x)

# make a new column
df["County"] = None
# move over the county information
df["County"][df[5] == "COUNTY"] = df[6]
# fill the county info forwards into the empty places
df["County"].fillna(method='ffill', inplace=True)

# fill the crime information forwards
df[0].fillna(method='ffill', inplace=True)

# reset the columns from one of the examples
df.columns = ["Crime"] + list(df.ix[3,1:-1]) + ["County"]
# get rid of any of the headings left in the table
df = df[~(df["Gender"] == "Gender")]

# toss anything which still has empty cells
df = df.dropna()

# reset the index, and fix the types
df = df.set_index(["Crime", "Gender", "County"]).astype(int)
df = df.reset_index()

产生

>>> df
      Crime Gender County  Age_20  Age_21  Age_22  Age_23  Asian  Hispanic  White
0   Robbery      F    001       1       2       2       2      3         3      3
1   Robbery      M    001       3       3       2       2      4         3      3
2     Fraud      F    001       1       2       2       2      3         3      2
3     Fraud      M    001       2       3       2       2      4         3      3
4     Arson      F    001       1       2       2       2      3         3      3
5     Arson      M    001       4       3       2       2      4         3      4
6   Robbery      F    002       1       2       2       2      3         3      3
7   Robbery      M    002       2       3       2       2      4         4      3
8     Fraud      F    002       1       2       2       2      3         3      2
9     Fraud      M    002       2       3       2       2      4         6      3
10    Arson      F    002       1       2       2       2      3         3      3
11    Arson      M    002       4       3       2       2      4         3      4
12  Robbery      F    003       1       2       2       2      3         3      3
13  Robbery      M    003       3       3       2       2      4         3      3
14    Fraud      F    003       1       2       1       4      3         3      2
15    Fraud      M    003       2       3       2       2      4         3      3
16    Arson      F    003       1       2       4       2      3         3      3
17    Arson      M    003       4       3       2       2      4         3      4

之后我们可以做各种整洁的事情。