Question

我有一个字符串列表，其中包含数千个不同结构的URL值，我正在尝试使用正则表达式从URL值中提取特定信息。下面给出了一个示例URL，您可以从中了解此特定URL的结构（请注意，此格式中有许多其他记录，只有数据在数据中发生变化）：

url_id | url_text
15     | /course/123908/discussion_topics/394785/entries/980389/read

在python中使用re库，我可以找到具有此结构的URL：

re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text)

但是，我还需要提取'394785'和'980389'值并创建一个可能如下所示的新矩阵：

url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15     | 1            | 1            | 0       | 0     | 1    | it goes like this

有人可以帮我提取这些特定信息吗？我知道'str'的'拆分'方法可能是一种选择。但是，我想知道是否有更好的解决方案。

谢谢！

Answer 1

你的意思是这样吗？

import re

text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"

for match in re.finditer(pattern, text):
    topic, entry  = match.group('topic'), match.group('entry')
    print('Topic ID={}, entry ID={}'.format(topic, entry))

<强>输出

Topic ID=394785, entry ID=980389

使用正则表达式从字符串列表中提取特定信息

1 个答案: