解析并替换多行字符串中的多个URL

时间:2019-06-20 20:17:48

标签: python url multilinestring

对于我的项目,我正在解析多个JIRA票证,这些票证在描述字段中具有不同的文本格式(aka每个链接的顺序/数量可能会更改),并删除所有以特定URL开头的链接(例如: www.website.com/app/client =?................)。 我正在使用Python来做到这一点。

我面临的问题是我不确定如何解析整个URL,因为长度/格式每次都会改变。 我尝试使用re库并获得与URL特定部分匹配的内容(例如:直到“ / monitor”,但不包括“ /monitor-text.random12443”之后的部分)。

我应该改用urlparse库吗?如果可以,如何在不包含其余字符串的情况下识别不同的URL?

问题摘要:

如何解析多行字符串,以一定的起始顺序(例如:www.removethisurl.com/)标识多个链接,并删除/替换整个字符串,而不更改其余内容字符串?

任何人和所有帮助将不胜感激!


作为参考,下面是描述字段中的字符串示例:

h1. Title of this section
Random text 
more random text
.
.
.
h3. URL section
Random text
description of the URL | Url that does NOT need to be changed
description of the URL | Url that does NOT need to be changed

Random text:
description of the URL | Url that DOES need to be changed

Url that DOES need to be changed
.
.
.
h3. More text

more random URLS that DO NOT need to be changed

我尝试过的示例代码:

# Parse through each issue in the list
for issue in range(totalComp):  
    # is a dict with the following format {'customfield..' : 'URL'}
    dataToChange = json_data_comp['issues'][issue]['fields']  
    print("Issue {} has the following componentfield_11111 entry: {}".format(issue, dataToChange))
    testStr = json_data_comp['issues'][issue]['fields']['customfield_15462']

    x = urlparse(testStr)
    print(f'X fragment is: {x.fragment}')
    rest = re.search('[^\s]+', x.fragment)  # Selects up until the whitespace
    print(f'X fragment is: {rest}')

示例输出:

Issue 23 has the following componentfield_11111 entry: {'customfield_15462': 'Production: https://www.URLtoChange.com/sv.do?id=KTy9VBJe2rgWug1BNFbCJYnyuE37TMvtPzQQbJiQpCAGtN9msOPrrcYb4DvsQiY%2FUR5WD%2FXshycb%0AxrwvH6CoPHoiIFDuVU4z      Preview Texts: https://URL.to.NOT.change.com/?filter=HCPms'}

X fragment is: 

X fragment is: None

0 个答案:

没有答案