Question

我目前对一些小数据集的基本问题感到困惑。以下是说明数据格式的前三行：

“体育”， “入口”， “Contest_Date_EST”， “地点”， “点”， “Winnings_Non_Ticket”， “Winnings_Ticket”， “Contest_Entries”， “Entry_Fee”， “Prize_Pool”， “Places_Paid”

“NBA”，“NBA 3K Crossover＃3 [3,000保证]（仅限早期）（1/15）”，“2015-03-01 13:00:00”，35,283.25，“13.33”，“0.00” ，171， “20.00”， “3,000.00”，35

“NBA”，“NBA 1,500上篮＃4 [1,500保证]（仅限早期）（1/25）”，“2015-03-01 13:00:00”，148,283.25，“3.00”，“0.00” ，862， “2.00”， “1,500.00”，200

使用read_csv创建DataFrame后我遇到的问题：

某些分类值（例如Prize_Pool）中存在逗号会导致python将这些条目视为字符串。我需要将这些转换为浮点数以进行某些计算。我已经使用了python的replace（）函数来摆脱逗号，但就我所知，这就是我所知道的。
类别Contest_Date_EST包含时间戳，但有些会重复。我想将整个数据集子集化为只有唯一时间戳的数据集。选择删除重复的条目或条目会更好，但目前我只想用唯一的时间戳过滤数据。

Answer 1

对包含逗号

的数字使用thousands=','参数

In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')

您可以查看Prize_Pool是否为数字

In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64

要删除行 - 首先观察，也可以采取最后一次

In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
  Sport                                              Entry  \
0   NBA  NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...

      Contest_Date_EST  Place  Points  Winnings_Non_Ticket  Winnings_Ticket  \
0  2015-03-01 13:00:00     35  283.25                13.33                0

   Contest_Entries  Entry_Fee  Prize_Pool  Places_Paid
0              171         20        3000           35

Answer 2

编辑：刚刚意识到你正在使用熊猫 - 应该看一下。   我现在就把它留在这里，以防它适用但是如果它适用的话   我倾向于通过同伴压力将其拒之门外。）

我今晚稍后会尝试更新它以使用熊猫

似乎itertools.groupby()是这项工作的工具;

这样的东西？

import csv
import itertools

class CsvImport():

    def Run(self, filename):
        # Get the formatted rows from CSV file
        rows = self.readCsv(filename)
        for key in rows.keys():
            print "\nKey: " + key
            i = 1
            for value in rows[key]:
                print "\nValue {index} : {value}".format(index = i, value = value)
                i += 1

    def readCsv(self, fileName):
        with open(fileName, 'rU') as csvfile:
            reader = csv.DictReader(csvfile)
            # Keys may or may not be pulled in with extra space by DictReader()
            # The next line simply creates a small dict of stripped keys to original padded keys
            keys = { key.strip(): key for (key) in reader.fieldnames }
            # Format each row into the final string
            groupedRows = {}
            for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
                groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
            return groupedRows;

    def normalizeRow(self, row):
        row[1] = float(row[1].replace(',','')) # "Prize_Pool"
        # and so on
        return row


if __name__ == "__main__":
    CsvImport().Run("./Test1.csv")

输出： enter image description here

更多信息：

https://docs.python.org/2/library/itertools.html

希望这会有所帮助：）

Python数据争论问题

2 个答案: