import numpy as np
import pandas as pd
尝试使用熊猫读取csv文件 这是我抓取的数据。 请注意,有括号的开头和结尾[](也许是列表)。我应该怎么写才能使整个数据都以表格形式出现?我不知道如何从数据中分离括号。
[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of
我应该如何读取CSV文件?我想要所有数据以表格形式。请帮助
答案 0 :(得分:2)
您的问题恰恰是错误消息告诉您的内容。错误在于解析此行:
['阿拉巴马大学(市场学理学硕士, 营销分析专业)”,“硕士”,“美国”,“ AL”,“ Online /','曼德森商学院']
该代码将忽略引号字符,并将行分成多个字段,并在找到定界符“,”的位置处进行分隔。您希望这是一个字段:
阿拉巴马大学(市场学理学硕士, 营销分析专业化
,但是此“字段”中包含定界符“”的实例,CSV解析器将使用该定界符,因为它忽略了用引号引起该值的事实。因此,这部分数据分为两个字段:
[''阿拉巴马大学(市场学理学硕士
和
Marketing Analytics专业化”)
这导致该行被分成7个字段,并且您的代码只需要6个字段。
请注意,此外,您的商品将包含引号,这可能也不是您期望的,并且那些方括号也不属于该范围。简而言之,这不是格式正确的CSV文件。
更新:我是一个正则表达式中间人。我使用正则表达式进行所有操作,并且不能忽略这样的挑战。这是一个基于正则表达式的解决方案,将从这些数据中准确读取您想要的内容。如果希望它识别数据的最后一行,则应在该行的末尾添加“']”。
import regex
from pprint import pprint
def parse_file(file):
linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
with open(file) as f:
r = []
while True:
line = f.readline()
if not line:
break
line = line.strip()
if len(line) == 0:
continue
m = linepat.match(line)
if m and m.captures(4):
fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
r.append(fields)
return r
def main():
r = parse_file("/tmp/blah.csv")
pprint(r)
main()
结果:
[['Auburn University (Online Master of Business Administration with '
'concentration in Business Analytics)',
'Masters',
'US',
'AL',
'/Campus',
'Raymond J. Harbert College of Business'],
...
['University of Arkansas (Professional Master of Information Systems)',
'Masters',
'US',
'AR',
'/Campus',
'Sam M. Walton College of']]
请注意,这不使用内置的“ re”模块。该模块不处理重复的组,这对于此类问题是必须的。另请注意,这不涉及熊猫。我对该模块一无所知,如果您真正想要的话,我认为将这段代码中干净的,经过解析的数据馈送到Pandas中是不重要的。
答案 1 :(得分:-1)
读取file.csv的基本方法。
def process(string):
print("Processing:",string)
data = []
for line in open("file.csv"):
process(string)
line = line.replace("\n","")
process_code()