Multiple regex string replace on large text file using Python

时间:2018-01-23 19:37:03

标签: python regex pandas parsing replace

I am having some very large text file on which I want to execute multiple regex based string replacement. Currently I am doing it using Sublime's similar feature. However, in files larger than a GB my system is hanging.

I am running some of the below matches in my sublime currently

\\\n - Remove all the backslash followed by newline.

\n - Remove all newlines.

\=\\\" - Replace all instances of =\" with just ="

In one case, I also want to group the match and use it in the replaced text.

Some experts around me suggested writing a quick python script for the same, and performance won't be an issue.

With my limited python knowledge, I tried something as below:

import pandas as pd
import numpy as np

df = pd.read_csv('story_all.csv')

output = df.str.replace('\n', '')

output.to_csv(story_done.csv, sep='\n', encoding='utf-8')

It, however, isn't working. And somewhere I think, I might be overdoing.


Note: The fact the text file is CSV doesn't really matter. I just need to execute some string replaces. The new line required by CSV is preserved while it's done.


The error am getting is as below:

Traceback (most recent call last): File "replace.py", line 4, in df = pd.read_csv('story_all.csv') File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read data = parser.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read ret = self._engine.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 8058, saw 65

Example of a CSV file value:

id,title,name_in_english,type,water_directory_term,org_work_area_term,org_type_term,defined_state,org_location_taluka_term,org_location_state_term,org_location_village_term,org_name_term,ha_free_term,org_location_dist_term,fax,samprak_bekti,email,phoneno,website/blog,postal_address,sangathan_ke_bare_main,rajya_state,taluka_sahar,jilla_district,kisi_prakar_kaa_sangathan,name,ID,created,status
"883","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"884","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"885","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"886","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"

2 个答案:

答案 0 :(得分:1)

如果我理解正确,你可以这样做。
这似乎适用于您共享的数据样本

import pandas as pd

df = pd.read_csv('story_all.csv', sep=',')

# Chars to replace
chars = [
    '\n',
]

output = df.replace(chars, '', regex=True)
output.to_csv('story_done.csv', sep=',', encoding='utf-8', index=False)

答案 1 :(得分:0)

我终于能够在没有熊猫帮助的情况下完成所需的任务。虽然该方法将整个文件读取到内存,但对于MacBook Pro上高达1-1.5 GB的文件,它可以很好地工作。它符合我的目的。我找到了此here的基本代码。

# import the modules that we need. (re is for regex)
import os, re

# set the working directory for a shortcut
os.chdir('/Users/username/Code/python/regex')

# open the source file and read it
# fh = file('org.csv', 'r')
fh = file('story_all.csv', 'r')
thetext = fh.read()
fh.close()

# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters

#match all newline followed by backslash.
p1 = re.compile(r'\n\\')
# p2 = re.compile(r'\n')
#match all newline except the one followed by digits in quotes.
p2 = re.compile(r'\n+(?!\"\d+\")')
p3 = re.compile(r'\\N')
p4 = re.compile(r'\=\\\"')




# do the replace
result = p1.sub("", thetext)
result = p2.sub("", result)
result = p3.sub("", result)
result = p4.sub('="', result)

# write the file
f_out = file('done.csv', 'w')
f_out.write(result)
f_out.close()

当使用接近1 GB的文件时,大约需要30-40秒。