我想删除CSV文件字段数据中的新行字符。 SO /其他地方的多个人也会问同样的问题。但是,提供的解决方案是脚本编写的。我正在寻找像PYTHON这样的编程语言或Spark中的解决方案(不仅仅是这两个),因为我有很大的文件。
以前就同一主题提出的问题:
我有一个大小约为1GB的CSV文件,想要删除字段数据中的新行字符。 CSV文件的架构动态变化,因此我无法对架构进行硬编码。换行符并不总是出现在逗号之前,即使在字段中也会随机出现。
示例数据:
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is
Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is
Cricket"
,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is
Team4",DET,AL,1
dykesji01,1933,5,"Game name is
Hockey"
,"Team name
Team5",CHA,AL,1
预期输出:
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is Cricket" ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is Team4",DET,AL,1
dykesji01,1933,5,"Game name is Hockey","Team name Team5",CHA,AL,1
换行符可以包含在任何字段的数据中。
修改 根据代码截图:
答案 0 :(得分:1)
您可以使用re
,pandas
和io
模块,如下所示:
import re
import io
import pandas as pd
with open('data.csv','r') as f:
data = f.read()
df = pd.read_csv(io.StringIO(re.sub('"\s*\n','"',data)))
for col in df.columns: #To replace all line breaks in all textual columns
if df[col].dtype == np.object_:
df[col] = df[col].str.replace('\n','');
In [78]: df
Out[78]:
playerID yearID gameNum gameName teamName lgID GP startingPos
0 gomezle01 1933 1 Cricket Team1 NYA AL 1
1 ferreri01 1933 2 Hockey This is Team2 BOS AL 1
2 gehrilo01 1933 3 Game name is Cricket Team3 NYA AL 1
3 gehrich01 1933 4 Hockey Here it is Team4 DET AL 1
4 dykesji01 1933 5 Game name is Hockey Team name Team5 CHA AL 1
如果您希望此DataFrame
作为输出CSV
文件使用:
df.to_csv('./output.csv')
答案 1 :(得分:0)
这是一个基本的,在通过csv读取之前进行简单的预处理。
import csv
def simple_sanitize(data):
result = []
for i, a in enumerate(data):
if i + 1 != len(data) and data[i + 1][0] == ',':
a = a.replace('\n', '')
result.append(a + data[i + 1])
elif a[0] != ',':
result.append(a)
return result
data = [line for line in open('test.csv', 'r')]
sdata = simple_sanitize(data)
with open('out.csv','w') as f:
for row in sdata:
f.write(row)
result = [list(val.replace('\n', '') for val in line) for line in csv.reader(open('out.csv', 'r'))]
print(result)
结果:
[['playerID', 'yearID', 'gameNum', 'gameName', 'teamName', 'lgID', 'GP', 'startingPos'],
['gomezle01', '1933', '1', 'Cricket', 'Team1', 'NYA', 'AL', '1'],
['ferreri01', '1933', '2', 'Hockey', 'This is Team2', 'BOS', 'AL', '1'],
['gehrilo01', '1933', '3', 'Game name is Cricket ', 'Team3', 'NYA', 'AL', '1'],
['gehrich01', '1933', '4', 'Hockey', 'Here it is Team4', 'DET', 'AL', '1'],
['dykesji01', '1933', '5', 'Game name is Hockey', 'Team name Team5', 'CHA', 'AL', '1']]
答案 2 :(得分:0)
它可以使用一点清洁,但这里有一些代码可以做你想要的。适用于字段内和逗号之前的换行符。如果需要更多要求,可以进行一些调整:
import csv
with open('data.csv', 'r') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
actual_rows = [next(reader)]
length = len(actual_rows[0])
real_row = []
for row in reader:
if len(row) < length:
if real_row:
real_row[-1] += row[0]
real_row += row[1:]
else:
real_row = row
else:
real_row = row
if len(real_row) == length:
real_row = map(lambda s: s.replace('\n', ' '), real_row)
# store real_row or use it as needed
actual_rows.append(list(real_row))
real_row = []
print(actual_rows)
我将修正后的行存储在actual_rows
中,但如果您不想加载到内存中,只需在评论中指出的每个循环中使用real_row
变量
答案 3 :(得分:0)
此解决方案的基本思想是使用grouper recipe获取固定长度的块(长度等于第一行中的列数)。由于它不会立即读取整个文件,因此不会因大文件而耗尽内存使用量。
$ cat a.py
import csv,itertools as it,operator as op
def grouper(iterable,n):return it.zip_longest(*[iter(iterable)]*n)
with open('in.csv') as inf,open('out.csv','w',newline='') as outf:
r,w=csv.reader(inf),csv.writer(outf)
hdr=next(r)
w.writerow(hdr)
for row in grouper(filter(bool,map(op.methodcaller('replace','\n',''),it.chain.from_iterable(r))),len(hdr)):
w.writerow(row)
$ python3 a.py
$ cat out.csv
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,This is Team2,BOS,AL,1
gehrilo01,1933,3,Game name is Cricket ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,Here it is Team4,DET,AL,1
dykesji01,1933,5,Game name is Hockey,Team name Team5,CHA,AL,1
这里假设的一个假设是输入csv中没有空单元格。