更新

Question

这是我的面试挑战的一部分。我无法找到任何资源，可以在不使用Pandas或Numpy的情况下用Python创建数据框。我很好奇，想知道从CSV创建Datframe而不使用库并执行数据分析的方法。任何指示都会有帮助。

Answer 1

有类似的要求，并提出了此解决方案；一个将csv转换为json的功能（需要json才能提高可读性，并且无需访问Pandas即可更轻松地查询数据）。如果该函数的headers为True，则csv的第一行用作json中的键，否则将值索引用作键。

from csv import reader as csv_reader

def csv_to_json(csv_path: str, headers: bool) -> list:
  '''Convert data from a csv to json'''
  # store json data
  json_data = []
  
  try:
    with open(csv_path, 'r') as file:
      reader = csv_reader(file)
      # set column names using first row
      if headers:
        columns = next(reader)
      
      # convert csv to json
      for row in reader:
        row_data = {}
        for i in range(len(row)):
          # set key names
          if headers:
            row_key = columns[i].lower()
          else: 
            row_key = i
          # set key/value
          row_data[row_key] = row[i]
        # add data to json store 
        json_data.append(row_data)
        
  # error handling
  except Exception as e:
    print(repr(e))
    
  return json_data

给出一个包含以下内容的csv

+------+-------+------+
| Year | Month | Week |
+------+-------+------+
| 2020 |    11 |   11 |
| 2020 |    12 |   12 |
+------+-------+------+

带有标题的输出是

[
  {"year": 2020, "month": 11, "week": 11},
  {"year": 2020, "month": 12, "week": 12}
]

没有标题的输出是

[
  {"0": 2020, "1": 11, "2": 11},
  {"0": 2020, "1": 12, "2": 12}
]

Answer 2

当生产环境受内存限制时，能够在不导入额外库的情况下读取和管理数据可能会有所帮助。

为了实现这一点，内置的 csv 模块完成了这项工作。

import csv

至少有两种方法可以做到这一点：使用 csv.Reader() 或使用 csv.DictReader()。

csv.Reader() 允许您使用索引访问 CSV 数据，非常适合简单的 CSV 文件 (Source)。

另一方面，

csv.DictReader() 更友好且易于使用，尤其是在处理大型 CSV 文件 (Source) 时。

这是如何使用 csv.Reader()

>>> import csv
>>> with open('eggs.csv', newline='') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print(', '.join(row))
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

这是如何使用 csv.DictReader()

>>> import csv
>>> with open('names.csv', newline='') as csvfile:
...     reader = csv.DictReader(csvfile)
...     for row in reader:
...         print(row['first_name'], row['last_name'])
...
Eric Idle
John Cleese

>>> print(row)
{'first_name': 'John', 'last_name': 'Cleese'}

再举一个例子，check Real Python's page here。

Answer 3

您很可能将需要一个库来读取CSV文件。尽管您可能会自己打开和解析数据，但这将是乏味且耗时的。幸运的是，python随附了一个标准的csv模块，您无需进行pip安装！您可以这样读取文件：

import csv

with open('file.csv', 'r') as file:
    my_reader = csv.reader(file, delimiter=',')
    for row in my_reader:
        print(row)

这将向您显示每个row作为列表被读入。然后，您可以根据索引对其进行处理！还有其他读取数据的方法，如https://docs.python.org/3/library/csv.html中所述，其中一种将创建字典而不是列表！

更新

您将github链接到了我剪的那个项目

product_id,product_name,aisle_id,department_id
9327,Garlic Powder,104,13
17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
28985,Michigan Organic Kale,83,4
32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
33120,Organic Egg Whites,86,16
45918,Coconut Butter,19,13
46667,Organic Ginger Root,83,4
46842,Plain Pre-Sliced Bagels,93,3

将其另存为file.csv，并使用我发布的上述代码运行它。结果：

['product_id', 'product_name', 'aisle_id', 'department_id']
['9327', 'Garlic Powder', '104', '13']
['17461', 'Air Chilled Organic Boneless Skinless Chicken Breasts', '35', '12']
['17668', 'Unsweetened Chocolate Almond Breeze Almond Milk', '91', '16']
['28985', 'Michigan Organic Kale', '83', '4']
['32665', 'Organic Ezekiel 49 Bread Cinnamon Raisin', '112', '3']
['33120', 'Organic Egg Whites', '86', '16']
['45918', 'Coconut Butter', '19', '13']
['46667', 'Organic Ginger Root', '83', '4']
['46842', 'Plain Pre-Sliced Bagels', '93', '3']

这符合您在问题中的要求。我不会为您做您的项目，您应该可以在这里工作。

Answer 4

最近，我遇到了一个非常类似的问题，即在不使用熊猫的情况下构建数据结构比这个问题更加复杂。这是我到目前为止发现的唯一相关问题。如果我提出这个问题，那么我被问到的是：将产品ID作为字典的键，然后将过道和部门ID的元组列表作为值（在python中）。字典是必需的数据框。当然，我不可能在15分钟内（而是2小时内）做到这一点。对于我来说，除了numpy和pandas之外，我很难想到。

我有以下解决方案，它们在一开始也回答了这个问题。可能不理想，但满足了我的需求。
希望这也会有所帮助。

import csv
file =  open('data.csv', 'r')
reader = csv.reader(file)

items = []  # put the rows in csv to a list
aisle_dept_id = []  # to have a tuple of aisle and dept ids
mydict = {} # porudtc id as keys and list of above tuple as values in a dictionary

product_id, aisle_id, department_id, product_name = [], [], [], []

for row in reader:
    items.append(row)

for i  in range(1, len(items)):
    product_id.append(items[i][0])
    aisle_id.append(items[i][1])
    department_id.append(items[i][2])
    product_name.append(items[i][3])

for item1, item2 in zip(aisle_id, department_id):
    aisle_dept_id.append((item1, item2))
for item1, item2 in zip(product_id, aisle_dept_id):
    mydict.update({item1: [item2]})

随着输出，

mydict:
{'9327': [('104', '13')],
 '17461': [('35', '12')],
 '17668': [('91', '16')],
 '28985': [('83', '4')],
 '32665': [('112', '3')],
 '33120': [('86', '16')],
 '45918': [('19', '13')],
 '46667': [('83', '4')],
 '46842': [('93', '3')]}

读取CSV文件并执行数据分析，而不使用任何库（例如Numpy和Pandas）？

4 个答案:

更新