Question

我正在尝试设置一个标准的工作流程，以便有效地将数据从荷兰国家统计局（http://statline.cbs.nl）用SPSS语法导入到R和/或Python中，这样我就可以进行分析，将其加载到我们的数据库中等

好消息是，他们已经标准化了许多不同的输出格式，其中包括.sps语法文件。实质上，这是一个以空格分隔的数据文件，其中包含标题和页脚中的额外信息。该文件如下所示。我更喜欢使用这种格式而不是普通的.csv，因为它包含更多数据，并且可以更容易地以一致的方式导入大量数据。

坏消息是我无法在Python和/或R中找到可以处理.sps SPPS语法文件的工作库。大多数库使用二进制.sav或.por格式。

我不是在寻找一个完整的SPSS克隆，而是使用带有关键字“DATA LIST”的元数据正确解析数据（每列的长度，'VAR LABELS'（列标题）和'VALUE LABELS'（导入期间应加入/替换额外数据）。

我确信可以编写一个Python / R库来有效地解析和处理所有这些信息，但我不是那种流利/经验的任何一种语言都可以自己做。

任何建议或提示都会有所帮助

SET            DECIMAL = DOT.
TITLE          "Gezondheidsmonitor; regio, 2012, bevolking van 19 jaar of ouder".
DATA LIST      RECORDS = 1
 /1            Key0         1 -    5 (A)
               Key1         7 -    7 (A)
               Key2         9 -   14 (A)
               Key3        16 -   23 (A)
               Key4        25 -   28 (A)
               Key5        30 -   33 (A)
               Key6        35 -   38 (A)
               Key7        40 -   43 (A).

BEGIN DATA
80200 1 GM1680 2012JJ00 .    .    .    .   
80200 1 GM0738 2012JJ00 13.2 .    .    21.2
80200 1 GM0358 2012JJ00 .    .    .    .   
80200 1 GM0197 2012JJ00 13.7 .    .    10.8
80200 1 GM0059 2012JJ00 12.4 .    .    16.5
80200 1 GM0482 2012JJ00 13.3 .    .    14.1
80200 1 GM0613 2012JJ00 11.6 .    .    16.2
80200 1 GM0361 2012JJ00 17.0 9.6  17.1 14.9
80200 1 GM0141 2012JJ00 .    .    .    .   
80200 1 GM0034 2012JJ00 14.3 18.7 22.5 18.3
80200 1 GM0484 2012JJ00 9.7  .    .    15.5

(...)

80200 3 GM0642 2012JJ00 15.6 .    .    19.6
80200 3 GM0193 2012JJ00 .    .    .    .   
END DATA.
VAR LABELS
               Key0      "Leeftijd"/
               Key1      "Cijfersoort"/
               Key2      "Regio's"/
               Key3      "Perioden"/
               Key4      "Mantelzorger"/
               Key5      "Zwaar belaste mantelzorgers"/
               Key6      "Uren mantelzorg per week"/
               Key7      "Ernstig overgewicht".

VALUE LABELS
               Key0      "80200"  "65 jaar of ouder"/
               Key1      "1"  "Percentages"
                         "2"  "Ondergrens"
                         "3"  "Bovengrens"/
               Key2      "GM1680"  "Aa en Hunze"
                         "GM0738"  "Aalburg"
                         "GM0358"  "Aalsmeer"
                         "GM0197"  "Aalten"
                         (...)
                         "GM1896"  "Zwartewaterland"
                         "GM0642"  "Zwijndrecht"
                         "GM0193"  "Zwolle"/
               Key3      "2012JJ00"  "2012".

LIST           /CASES TO 10.

SAVE           /OUTFILE "Gezondheidsmonitor__regio,_2012,_bevolking_van_19_jaar_of_ouder.SAV".

Answer 1

一些示例代码可以帮助您入门 - 对不起这里最好的Python程序员...所以任何改进都可能是受欢迎的。这里添加的步骤是加载标签并为LABEL VALUES创建一个dicts列表的方法......

f = open('Bevolking_per_maand__100214211711.sps','r')
#lines = f.readlines()
spss_keys = list()
data = list()
begin_data_step= False
end_data_step = False

for l in f:
    # first look for TITLE
    if l.find('TITLE') <> -1:
        start_pos=l.find('"')+1
        end_pos = l.find('"',start_pos+1)
        title = l[start_pos:end_pos]
        print "title:" ,title

    if l.find('DATA LIST') <> -1:
        data_list = True
        start_pos=l.find('=')+1
        end_pos=len(l)
        num_records= l[start_pos:end_pos].strip()
        print "number of records =", num_records

    if num_records=='1':
        if ((l.find("Key") <> -1) and (not begin_data_step) and (not end_data_step)):
            spss_keys.append([l[15:22].strip(),int(l[23:29].strip()),int(l[32:36].strip()),l[37:].strip()])

    if l.find('END DATA.') <> -1:
        end_data_step=True

    if ((begin_data_step) and (not end_data_step)):
        values = list()
        for key in spss_keys:
            values.append(l[key[1]-1:key[2]])
        data.append(values)
        if l[-1]=="." :
            begin_data=False

    if l.find('BEGIN DATA') <> -1:
        begin_data_step=True

    if end_data_step:
        print ""
        # more to follow


data

Answer 2

从我的角度来看，我不打扰SPSS文件选项，但选择HTML版本并将其删除。它看起来表格很好地格式化了类，这样可以更容易地抓取/解析HTML ....

另一个需要回答的问题应该是：您是要手动下载文件，还是想自动执行此操作？

使用SPSS语法导入数据。 '价值标签'和'var labels'

2 个答案: