Question

我有一个数据框df，其中包含一系列年份的许多字段名称。

                                                   field
year description                                               
1993 bar0                                       a01arb92
     bar1                                       a01svb92
     bar2                                       a01fam92
     bar3                                       a08
     bar4                                       a01bea93

然后，对于每年，我都有一个stata文件，其中id为列，另外列为df中提到的一些（或所有）字段名称。例如，1993.dta可能是

id a01arb92 a01svb92 a08 a01bea93
0         1        1   1        1
0         1        1   1        2

我需要每年查看df中列出的所有字段是否确实存在（作为列）在相应的文件中。然后我想将结果保存回原始数据框中。有没有一种很好的方法可以在不迭代每个字段的情况下做到这一点？

预期产出：

                                                   field   exists
year description                                               
1993 bar0                                       a01arb92        1
     bar1                                       a01svb92        1
     bar2                                       a01fam92        0
     bar3                                       a08             1
     bar4                                       a01bea93        1

例如，如果除a01fam92之外的每个字段都作为列存在于1993文件中。

Answer 1

尝试每年进行一次，过滤数据框以获取与每个特定年份相关联的字段，然后比较元素是否在stata文件中或否。

使用read_stata：

 import pandas as pd
 d= pd.stata.read_stata("file")

读取您的csv文件，并将其存储在数据框中

 import pandas as pd
 df= pd.read_csv("file")

过滤并提取每年的字段。

df[df["year"]==1993].fields #Output: List of fields in year 1993

您可以通过查看年份列表来整合流程

l= df.year
for x in l:
   f= df[df["year"]==x].fields
   # Then check if f in strata file.

在这里您可以找到有关如何filter fields using Pandas的详细说明。

将starata字段与您拥有的列表进行比较

您可以使用All()运营商。

All(item for item in f if item in d)

如果是真，那么该字段中的所有元素都在分层文件中。

使功能中的所有内容。

l= df.year #List of years
IsInDic={} #Dictinary to store a year:<All Fields in stata field> eg: {1993:True}
for x in l:
    f= df[df["year"]==x].fields
   # Then check if f in strata file.
    isInList= All(item for item in f if item in d)
    IsInDic[x]=isInList #Add everything in a dictionary to help you later decide whether it's true or no.

<强>更新

def isInList(x):
  return  [ x for x in d if x in df[df["year"]==x].fields] == d

Answer 2

这是一种利用pandas会自动填充缺少索引的NaN这一事实的方法。

首先准备数据。您可能已经完成了这一步。

df1 = pd.read_csv(r'c:\temp\test1.txt', sep=' ')

df1
Out[30]: 
   year description     field
0  1993        bar0  a01arb92
1  1993        bar1  a01svb92
2  1993        bar2  a01fam92
3  1993        bar3       a08
4  1993        bar4  a01bea93

df1 = df1.set_index(['year', 'description', 'field'])

df2 = pd.read_csv(r'c:\temp\test2.txt', sep=' ')

df2
Out[33]: 
   year description     field
0  1993        bar0  a01arb92
1  1993        bar1  a01svb92
2  1993        bar3       a08
3  1993        bar4  a01bea93

df2 = df2.set_index(['year', 'description', 'field'])

接下来，在df2中创建一个新列，并使用pandas将这些列复制到上一个数据帧。这将填充NaN的缺失值。然后使用fillna指定值0。

df2['exists'] = 1

df1['exists'] = df2['exists']

df1
Out[37]: 
                           exists
year description field           
1993 bar0        a01arb92       1
     bar1        a01svb92       1
     bar2        a01fam92     NaN
     bar3        a08            1
     bar4        a01bea93       1

df1.fillna(0)
Out[38]: 
                           exists
year description field           
1993 bar0        a01arb92       1
     bar1        a01svb92       1
     bar2        a01fam92       0
     bar3        a08            1
     bar4        a01bea93       1

检查列名称是否存在

2 个答案: