如何压缩,即消除以下数据中的冗余:
code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20
输出应该是这样的:
GB-ENG, 3521
RO-B, 9
DE-NW, 4
DE-BW, 3
DE-HH, 34
DE-BY, 20
BE-BRU, 27
由每个代码的1个规范表示形式描述,即DE-BY
,表示与该代码的每个实例相关联的数字汇总的总和,例如:
code: DE-BY, jobs: 11
code: DE-BY, jobs: 9
变为
DE-BY, 20
目前我正在用这个Python脚本创建输入:
import json
import requests
from collections import defaultdict
from pprint import pprint
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
# open up the output of 'data-processing.py'
with open('job-numbers-by-location.txt') as data_file:
# print the output to a file
with open('phase_ii_output.txt', 'w') as output_file_:
for line in data_file:
identifier, name, coords, number_of_jobs = line.split("|")
coords = coords[1:-1]
lat, lng = coords.split(",")
# print("lat: " + lat, "lng: " + lng)
response = requests.get("http://api.geonames.org/countrySubdivisionJSON?lat="+lat+"&lng="+lng+"&username=s.matthew.english").json()
codes = response.get('codes', [])
for code in codes:
if code.get('type') == 'ISO3166-2':
country_code = '{}-{}'.format(response.get('countryCode', 'UNKNOWN'), code.get('code', 'UNKNOWN'))
if not hasNumbers( country_code ):
# print("code: " + country_code + ", jobs: " + number_of_jobs)
output_file_.write("code: " + country_code + ", jobs: " + number_of_jobs)
output_file_.close()
将此功能作为该脚本的一部分包含在内可能是最有效的,但我还没有弄清楚如何。
答案 0 :(得分:1)
以下代码使用您在当前代码中使用的dict.get()
方法来实现计数器。这是基于从当前.txt
文件中读取值,但您可以简单地绕过写入文件并使用类似方法进行后续读取。
tally = {}
with open('country_codes.txt', 'r') as infile, open('condensed.txt', 'w') as outfile:
for line in infile:
data = line.strip('\n')
tag1, code, tag2, num = data.split()
tally[code] = tally.get(code, 0) + int(num)
for key, value in tally.items(): # Use .iteritems() for Python 2.x
outfile.write(' '.join(map(str, [key, value, '\n'])))
这将采用具有以下结构的文件(country_codes.txt
):
code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20
将此内容写入condensed.txt
,如下所示:
DE-BY, 20
DE-HH, 34
DE-BW, 3
DE-NW, 4
RO-B, 9
GB-ENG, 3521
BE-BRU, 27
答案 1 :(得分:1)
你可以这样做:
data = """code: GB-ENG, jobs: 2673
code: GB-ENG, jobs: 23
code: GB-ENG, jobs: 459
code: GB-ENG, jobs: 346
code: RO-B, jobs: 9
code: DE-NW, jobs: 4
code: DE-BW, jobs: 3
code: DE-BY, jobs: 9
code: DE-HH, jobs: 34
code: DE-BY, jobs: 11
code: BE-BRU, jobs: 27
code: GB-ENG, jobs: 20"""
final_data = {}
for code, count in [_.strip('code: ').split(', jobs: ') for _ in data.split('\n')]:
if code in final_data:
final_data[code]['amount'] += int(count)
else:
final_data[code] = {'amount': int(count)}
for key, value in final_data.items():
print('code: {}, jobs: {}'.format(key, value['amount']))
答案 2 :(得分:1)
import sys, re
from collections import defaultdict
tally = defaultdict(int)
for line in sys.stdin:
match = re.match(r'^code: (?P<code>\S+), jobs: (?P<jobs>\d+)', line).groupdict()
tally[match["code"]] += int(match["jobs"])
for code, jobs in tally.iteritems():
print "{}, {}".format(code, jobs)
答案 3 :(得分:1)
这假设您的countries.txt格式为
code: GB-ENG jobs: 2673
code: GB-ENG jobs: 23
code: GB-ENG jobs: 459
code: GB-ENG jobs: 346
code: RO-B jobs: 9
code: DE-NW jobs: 4
code: DE-BW jobs: 3
code: DE-BY jobs: 9
code: DE-HH jobs: 34
code: DE-BY jobs: 11
code: BE-BRU jobs: 27
code: GB-ENG jobs: 20
代码段
with open('countries.txt') as input_file, open('phase_ii_output.txt', 'w') as output_file:
args = []
dic = {}
for line in input_file:
args.append(line.split(" "))
for n in args:
key = n[1]
num = int(n[3].rstrip())
if key in dic:
dic[key] += num
else:
dic[key] = num
output_file.write(dic)
输出
{'BE-BRU': 27, 'DE-BY': 20, 'DE-NW': 4, 'DE-BW': 3, 'RO-B': 9, 'GB-ENG': 3521, 'DE-HH': 34}
答案 4 :(得分:1)
假设文本存储在文本文件中,这将起作用
infile = open('redundancy.txt','r')
a= infile.readlines()
print a
d={}
for item in a:
c=item.strip('\n')
b=c.split()
if b[1] in d :
d[b[1]]= int(d.get(b[1]))+eval((b[3]))
else:
d[b[1]]=b[3]
print d
它会给出一个结果:
{'DE-BY,': 20, 'DE-HH,': '34', 'DE-BW,': '3', 'DE-NW,': '4', 'RO-B,': '9', 'GB-ENG,': 3521, 'BE-BRU,': '27'}