Question

我是python的新手，我的大部分工作都是在R中完成的。我想知道如何在python中使这个问题起作用。请参阅链接以清楚了解问题和解决方案R代码。 How to calculate a table of pairwise counts from long-form data frame

这是数据集：

id  featureCode
5   PPLC
5   PCLI
6   PPLC
6   PCLI
7   PPL
7   PPLC
7   PCLI
8   PPLC
9   PPLC
10  PPLC

这就是我想要的：

     PPLC  PCLI  PPL
PPLC  0     3     1
PCLI  3     0     1
PPL   1     1     0

我想计算每个特征代码与其他特征代码一起使用的次数（标题的“成对计数”）。我希望现在有意义。请提供帮助。感谢..

Answer 1

这可以使用字典设置进行设置，并使用集合和计数器进行分析。但是，我将使用最简单的字典和循环方法显示分析。当然实际代码可以做得更小，我故意展示扩展版本。我的Python没有可用的Pandas，所以我使用的是最基本的Python。

# Assume the you have a set of tuples lst
lst.sort() # sort the list by id
mydict = {}
id = None
tags = []
for ids in lst:
  if ids[0] == id
    # Pick up the current entry
    tags.append(ids[1])
  else:
    # This is a new id
    # check the count of the previous tags.
    for elem1 in tags:
      for elem2 in tags:
        if elem1 != elem2:
          if elem1 not in mydict:
            mydict[elem1] = {}
          if elem2 not in mydict[elem1]:
            mydict[elem1][elem2] = 0
          mydict[elem1][elem2] += 1
    # This is a different id, reset the indicators for the next loop
    id = ids[0]
    tags = ids[1]        # This is a new id
else:
  # The last element of the lst has to be processed as well
  # check the count of the previous tags.
  for elem1 in tags:
    for elem2 in tags:
      if elem1 != elem2:
        if elem1 not in mydict:
          mydict[elem1] = {}
        if elem2 not in mydict[elem1]:
          mydict[elem1][elem2] = 0
        mydict[elem1][elem2] += 1


# at this point, my dict has the full dictionary count
for tag in mydict.keys():
  print tag, mydict[tag]

现在为标签提供计数，您可以通过循环最终字典，打印键并适当计数来格式化输出。

Answer 2

以下是在Pandas中执行此操作的一种方法，它使用与R类似的DataFrame。我假设您有一个包含数据的DataFrame df。（您可以使用pandas.read_table从文件中读取数据。请参阅：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_table.html）。

首先，使用groupby按id对列进行分组。

gps = df.groupby("id")
print gps.groups
Out: {5: [0, 1], 6: [2, 3], 7: [4, 5, 6], 8: [7], 9: [8], 10: [9]}

groups给出属于同一个id的行号。

接下来，在featureCode中创建目标矩阵，其中行和列名称为唯一值。

unqFet = list(set(df["featureCode"]))
final = pandas.DataFrame(columns=unqFet, index=unqFet)
final = final.fillna(0)
print final
Out: 
            PCLI PPLC PPL
     PCLI    0    0   0
     PPLC    0    0   0
     PPL     0    0   0

最后，遍历您的群组并在final矩阵中增加正确的值。

for g in gps.groups.values():
    for i in range(len(g)):
       for j in range(len(g)):
          if i != j:
              final[ df["featureCode"][g[i]] ][ df["featureCode"][g[j]] ] += 1

print final
Out:
          PCLI PPLC PPL
   PCLI    0    3   1
   PPLC    3    0   1
   PPL     1    1   0

Python中成对频率计数表

2 个答案: