it's has been a while since I was working on with this but I can't figure out how to resolve my problem.
I have multiple paragraphs such as in the Packages.gz file present in this link http://fr.archive.ubuntu.com/ubuntu/dists/trusty-security/main/binary-amd64/
I would like your help to process it using a regular expression in order to get at the end a dictionary containing as keys the packages and values a list of the packages they provide.
As you can see, some packages do provide one or more packages others don't. My best regular expression was the following :
((?<=Package: ).*)|((?<=Provides: )(?:[, ]*[a-zA-Z0-9-+.]*))
It stops on the first package in the "Provides:" sentence, but I need them all without the ", ".
Any help is appreciated. Thank you.
答案 0 :(得分:1)
You don't need to reinvent the wheel here. The python-apt library already implements the text file parsing you want. I recommend using it. It will give you the list of provides for a package.
答案 1 :(得分:0)
Here is a program that builds a dict
object to map "package" lines to list
s representing "provides" lines.
It uses a regular expression, and re.findall
, as requested.
import re
from pprint import pprint
with open('Packages') as fp:
data = fp.read()
data = re.findall(
r'''
(?smx) # Dot matches all, Multiline, Verbose
^Package:\s*(.*?)$ # The package line
.*? # Arbitrary lines
(?:
^Provides:\s*(.*?$) # The provides line
| # OR
^$ # an empty line
)
''',
data)
data = {k:v.split(',') if v else [] for k,v in data}
pprint(data)
Alternatively, here is a solution that does not use a regular expression. It runs slightly faster in my PC, on your 70,000-line file. The speed difference is largely irrelevant, however; the difference is less than .02 seconds.
import re
from pprint import pprint
def gen():
with open('Packages') as fp:
for line in fp:
if line.startswith('Package:'):
package = line.split(':')[1].strip()
elif line.startswith('Provides:'):
yield package, line.split(':')[1].strip().split(',')
package = None
elif package and line == '\n':
yield package, []
package = None
if package:
yield package, []
data = dict(gen())
pprint(data)