搜索并拆分文本文件中的某些字符串并保存输出

时间:2017-12-25 05:59:28

标签: python string

如何在包含数字和字母字符的行中拆分某些字符串。

我拥有的数据是这样的(tembin-data.dat):

['3317121918', '69N1345E', '15']

['3317122000', '72N1337E', '20']

['3317122006', '75N1330E', '20']

['3317122012', '78N1321E', '20']

['3317122018', '83N1310E', '25']

.......etc

我需要删除"N""E"这样的新数据安排:

['3317121918', '69','1345','15']

['3317122000', '72','1337','20']

['3317122006', '75','1330','20']

['3317122012', '78','1321','20']

['3317122018', '83','1310','25']

.......etc

我目前使用的Python脚本是这样的:

newfile = open('tembin-data.dat', 'w')
with open('tembin4.dat', 'r') as inF:
     for line in inF:
         myString = '331712'
         if myString in line:
             data=line.split()
             print data
             newfile.write("%s\n" % data)
newfile.close() 

tembin4.dat如下:

REMARKS:

230900Z POSITION NEAR 7.8N 118.6E.

TROPICAL STORM 33W (TEMBIN), LOCATED APPROXIMATELY 769 NM EAST-

SOUTHEAST OF HO CHI MINH CITY, VIETNAM, HAS TRACKED WESTWARD AT

11 KNOTS OVER THE PAST SIX HOURS. MAXIMUM SIGNIFICANT WAVE HEIGHT

AT 230600Z IS 14 FEET. NEXT WARNINGS AT 231500Z, 232100Z, 240300Z

AND 240900Z.//

3317121918  69N1345E  15

3317122000  72N1337E  20

3317122006  75N1330E  20

3317122012  78N1321E  20

3317122018  83N1310E  25

3317122100  86N1295E  35

3317122106  85N1284E  35

3317122112  84N1276E  40

3317122118  79N1267E  50

3317122118  79N1267E  50

3317122200  78N1256E  45

3317122206  78N1236E  45

3317122212  80N1225E  45

3317122218  79N1214E  50

3317122218  79N1214E  50

3317122300  77N1204E  55

3317122300  77N1204E  55

3317122306  77N1193E  55

3317122306  77N1193E  55

NNNN

5 个答案:

答案 0 :(得分:2)

试试这个:

import re
for line in open(r"tembin4.txt","r"):
    lst = line.split(" ")
    for i,x in enumerate(lst):
        grp = re.findall('(\d+)N(\d+)E',x)
        if len(grp) !=0:
            lst.remove(x)
            lst.insert(i,grp[0][1])
            lst.insert(i,grp[0][0])
    print(" ".join(lst))

答案 1 :(得分:2)

只需使用正则表达式和拆分扩展您的方法。

import re
newfile = open('tembin-data.dat', 'w')

pat = re.compile("[N|E]")

with open('tembin4.dat', 'r') as inF:
 for line in inF:
     myString = '331712'
     if myString in line:
         data=line.split()
         data[2:2] = pat.split(data[1])[:-1] # insert the list flattend at index 2
         del data[1] # Remove string with N&E from list.
         print data
         newfile.write("%s\n" % data)

答案 2 :(得分:2)

  

您可以使用Positive Lookbehind (?<=N)Positive Lookahead(?=N)并抓取该群组:

import re
pattern="'\d+'|(\d+)(?=N)|(?<=N)(\d+)"
with open('file.txt','r') as f:
    for line in f:
        sub_list=[]
        search=re.finditer(pattern,line)
        for lin in search:
            sub_list.append(int(lin.group().strip("'")))

        if sub_list:
            print(sub_list)
  

输出:

[3317121918, 69, 1345, 15]
[3317122000, 72, 1337, 20]
[3317122006, 75, 1330, 20]
[3317122012, 78, 1321, 20]
[3317122018, 83, 1310, 25]
  

正则表达式信息:

'\d+'|(\d+)(?=N)|(?<=N)(\d+)/g'

\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed 

Positive Lookahead (?=N)
Assert that the Regex below matches
N matches the character N literally (case sensitive)

Positive Lookbehind (?<=N)
Assert that the Regex below matches
N matches the character N literally (case sensitive)

答案 3 :(得分:1)

使用pandas,您可以轻松完成此任务。

import pandas as pd
import os # optional

os.chdir('C:\\Users') # optional
df = pd.read_csv('tembin-data.dat', header = None)

df[3]= df[1].str.slice(1,3)
df[4]= df[1].str.slice(4,8)

df = df.drop(1, axis = 1)

df.to_csv('tembin-out.dat',header=False)

答案 4 :(得分:1)

您可以在Python3中尝试这个简短的解决方案:

import re
s = [['3317121918', '69N1345E', '15'], ['3317122000', '72N1337E', '20'], ['3317122006', '75N1330E', '20'], ['3317122012', '78N1321E', '20'],
['3317122018', '83N1310E', '25']]
new_s = [[a, *re.findall('\d+', b), c] for a, b, c in s]

输出:

[['3317121918', '69', '1345', '15'], ['3317122000', '72', '1337', '20'], ['3317122006', '75', '1330', '20'], ['3317122012', '78', '1321', '20'], ['3317122018', '83', '1310', '25']]