我已经阅读了所有Pandas文档,但我认为我需要一个实际的例子来理解。
我有这个带有我所有sql数据的.TXT文件。
INSERT INTO
jos_users
VALUES('4065','lel lel','joel', 'chazaa @ frame.com','d0c9f71c7bc8c9','Membre','0','0','2', '2013-01-31 17:15:29','2014-12-10 11:29:13','','{}');INSERT INTO
jos_users
VALUES('4066','jame lea','jamal', 'jamal.stan@frame.com','d0c9f71c7774c9','Membre','0','0','2', '2012-11-31 08:15:29','2012-12-10 12:29:13','','{}');
(大约17.000行),我的.txt文件中没有任何列名。
我想要实现的目标:
我的代码现在,显示垃圾:
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)
df = pd.read_csv('2.txt', sep=',', na_values=['g'], error_bad_lines=False)
print df
答案 0 :(得分:0)
好的,这是我刚刚敲了一个示例脚本,我不是说这是清理SQL脚本最有效的方法,理想情况下,如果你有权访问原始数据库那么你应该能够导出它作为一个csv。
无论如何,以下操作是打开文本文件并删除insert into
,打开和关闭大括号,引用字符(不是必需的,但我更喜欢这种风格)和任何无关的空格。
In [91]:
with open(r'c:\data\clean.csv', 'wt') as clean:
with open(r'c:\data\temp sql.txt', 'rt') as f:
for line in f:
if len(line) > 0:
l = line.replace('INSERT INTO jos_users VALUES (', '')
l = l.replace(", '", ",'")
l = l.replace("'",'')
l = l.replace(');','')
clean.write(l)
clean.close()
f.close()
# read the file back in, there is no header so you need to specify this
df = pd.read_csv(r'c:\data\clean.csv', header=None)
df
Out[91]:
0 1 2 3 4 5 6 \
0 4065 lel lel joel chazaa@frame.com d0c9f71c7bc8c9 Membre 0
1 4066 jame lea jamal jamal.stan@frame.com d0c9f71c7774c9 Membre 0
7 8 9 10 11 12
0 0 2 2013-01-31 17:15:29 2014-12-10 11:29:13 NaN {}
1 0 2 2012-11-31 08:15:29 2012-12-10 12:29:13 NaN {}
答案 1 :(得分:0)
编辑:以下方法比将更改的数据写入文件慢得多,然后使用read_csv()将更改的数据读入DataFrame。对于一个34,000行文件,它需要大约23分钟v .~3秒。
import pandas as pd
import numpy as np
import re
pd.set_option('display.width', 1000)
#Pre-allocate all the space needed by your DataFrame:
df = pd.DataFrame(index=np.arange(18000), columns=np.arange(13))
pattern = r""" #Find all single quoted sequences:
' #Match a single quote, followed by...
( #(start a capture group)
[^']* #not a single quote, 0 or more times, followed by...
) #(end the capture group)
' #a single quote
"""
regex = re.compile(pattern, flags=re.X)
f = open('data.txt')
for i, line in enumerate(f):
data = re.findall(regex, line) #findall() returns a list of all the strings that matched the pattern's capture group
if data:
df.iloc[i] = data #insert data at row i
print df
--output:--
0 1 2 3 4 5 6 7 8 9 10 11 12
0 4065 lel lel joel chazaa@frame.com d0c9f71c7bc8c9 Membre 0 0 2 2013-01-31 17:15:29 2014-12-10 11:29:13 {}
1 4066 jame lea jamal jamal.stan@frame.com d0c9f71c7774c9 Membre 0 0 2 2012-11-31 08:15:29 2012-12-10 12:29:13 {}
...
...
1798 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1799 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
re.findall(pattern,string,flags = 0)
返回字符串中pattern的所有非重叠匹配,作为字符串列表。从左到右扫描字符串,并按找到的顺序返回匹配项。如果模式中存在一个或多个组,则返回组列表;如果模式有多个组,这将是一个元组列表。结果中包含空匹配,除非它们触及另一场比赛的开头。
https://docs.python.org/2/library/re.html#re.findall
将更改的数据写入文件,然后使用read_csv()读取它:
import pandas as pd
import numpy as np
import re
import time
pd.set_option('display.width', 1000)
pattern = r"""
' #Match a single quote, followed by...
( #start a capture group.
[^']* #not a quote, 0 or more times, followed by...
) #end capture group.
' #a single quote
"""
regex = re.compile(pattern, flags=re.X)
fin = open('data2.txt') #The two insert statemetns in the op, repeated 17,000 times
fout = open('data.csv', 'w')
results = {}
for line in fin:
data = re.findall(regex, line)
if data:
print(*data, file=fout, sep=',')
fin.close()
fout.close()
df = pd.read_csv(
'data.csv',
sep=',',
header=None,
names=np.arange(13), #column names: 0 - 12
)
print(df)