尝试在python中读取csv文件并创建单独的表

时间:2019-05-04 20:50:32

标签: python jupyter-notebook

import numpy as np
import pandas as pd

尝试使用熊猫读取csv文件 这是我抓取的数据。 请注意,有括号的开头和结尾[](也许是列表)。我应该怎么写才能使整个数据都以表格形式出现?我不知道如何从数据中分离括号。

[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of 

我应该如何读取CSV文件?我想要所有数据以表格形式。请帮助

2 个答案:

答案 0 :(得分:2)

您的问题恰恰是错误消息告诉您的内容。错误在于解析此行:

  

['阿拉巴马大学(市场学理学硕士,   营销分析专业)”,“硕士”,“美国”,“ AL”,“   Online /','曼德森商学院']

该代码将忽略引号字符,并将行分成多个字段,并在找到定界符“,”的位置处进行分隔。您希望这是一个字段:

  

阿拉巴马大学(市场学理学硕士,   营销分析专业化

,但是此“字段”中包含定界符“”的实例,CSV解析器将使用该定界符,因为它忽略了用引号引起该值的事实。因此,这部分数据分为两个字段:

  

[''阿拉巴马大学(市场学理学硕士

  

Marketing Analytics专业化”)

这导致该行被分成7个字段,并且您的代码只需要6个字段。

请注意,此外,您的商品将包含引号,这可能也不是您期望的,并且那些方括号也不属于该范围。简而言之,这不是格式正确的CSV文件。

更新:我是一个正则表达式中间人。我使用正则表达式进行所有操作,并且不能忽略这样的挑战。这是一个基于正则表达式的解决方案,将从这些数据中准确读取您想要的内容。如果希望它识别数据的最后一行,则应在该行的末尾添加“']”。

import regex
from pprint import pprint

def parse_file(file):
    linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
    with open(file) as f:
        r = []
        while True:
            line = f.readline()
            if  not line:
                break
            line = line.strip()
            if len(line) == 0:
                continue
            m = linepat.match(line)
            if m and m.captures(4):
                fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
                r.append(fields)
    return r

def main():
    r = parse_file("/tmp/blah.csv")
    pprint(r)

main()

结果:

[['Auburn University (Online Master of Business Administration with '
  'concentration in Business Analytics)',
  'Masters',
  'US',
  'AL',
  '/Campus',
  'Raymond J. Harbert College of Business'],
 ...
 ['University of Arkansas (Professional Master of Information Systems)',
  'Masters',
  'US',
  'AR',
  '/Campus',
  'Sam M. Walton College of']]

请注意,这不使用内置的“ re”模块。该模块不处理重复的组,这对于此类问题是必须的。另请注意,这不涉及熊猫。我对该模块一无所知,如果您真正想要的话,我认为将这段代码中干净的,经过解析的数据馈送到Pandas中是不重要的。

答案 1 :(得分:-1)

读取file.csv的基本方法。

def process(string):
  print("Processing:",string)

data = []
for line in open("file.csv"):
  process(string)
  line = line.replace("\n","")
  process_code()