Question

我熟悉R数据持有者，如矢量，数据帧等，但需要进行一些文本分析，看起来python有一些很好的设置。我的问题是我在哪里可以找到python如何保存数据的解释。

具体来说，我在一个以制表符分隔的文件中有一个数据集，其中文本位于第3列，我需要的数据的得分位于第4列。

id1            id2            text                             score
123            889     "This is the text I need to read..."      88
234            778     "This is the text I need to read..."      78
345            667     "This is the text I need to read..."      91

在R中，我只是将其加载到名为df1的数据框中，当我想调用列时，我会使用df1 $ text或df1[,3]，如果我想要一个特定的单元格我可以使用df1[1,3]。

我感觉如何将数据读入python而不是如何处理类似于表的结构。

你会如何建议为一个python新手使用它？

Answer 1

查看DataFrame库中的pandas对象。

Answer 2

Ullrich先生使用pandas库的答案是最接近R数据框架的方法。但是，您可以使用numpy array获得极其相似的功能，必要时将数据类型设置为object。较新版本的numpy的field name capabilities类似于data.frame，其索引实际上比R更强大，并且它包含对象的能力远远超出了R的能力。

我同时使用R和numpy，具体取决于手头的任务。使用公式和内置统计数据更好。 Python代码更易于维护，更容易与其他系统连接。

编辑：补充说明numpy现在具有字段名称功能

Answer 3

我不确定这会转化为我从未使用过的'R'有多好，但在Python中这就是我接近它的方式：

lines = list()
with open('data.txt','r') as f:
  for line in f:
      lines.append(line.split())

这将读取python列表中的所有内容。列表从零开始。要从第二行获取文本列：

print lines[1][2]

该行的得分：

print lines[1][3]

Answer 4

除了Panda的DataFrame，您还可以使用rpy2库（来自http://thread.gmane.org/gmane.comp.python.rpy/1344）：

import array
import rpy2.robjects as ro

d = dict(x = array.array('i', [1,2]), y = array.array('i', [2,3]))
dataf = ro.r['data.frame'](**d)

Answer 5

我过去使用的一个选项是csv.DictReader，它允许您按名称引用行中的数据（每行变为dict）：

import csv
with open('data.txt') as f:
    reader = csv.DictReader(f, delimiter = '\t')
    for row in reader:
        print row

输出：

{'text': 'This is the text I need to read...', 'score': '88', 'id2': '889', 'id1': '123'}
{'text': 'This is the text I need to read...', 'score': '78', 'id2': '778', 'id1': '234'}
{'text': 'This is the text I need to read...', 'score': '91', 'id2': '667', 'id1': '345'}

Answer 6

python中R的等价物是Pandas

您初始化DataFrame如下所示

 import pandas as pd
 df = pd.read_csv("filename")

 print df.head()

从R转到Python，python等价于数据框架是什么？

6 个答案: