我有以下包含许多Url值的字符串。如何在此字符串中的DataUrl项后提取Url?所以我得到了Urls的清单 例如:americanexpress.com,Vice.com,chegg.com
{'DataUrl':'americanexpress.com','Country':{'Rank':'96','Reach':{'PerMillion':'7350'},'PageViews':{'PerMillion': '600.2','PerUser':'3.6'}},'Global':{'Rank':'362'}},{'DataUrl':'vice.com','Country':{'Rank':' 97”,“达到”:{'PerMillion':'15703.61'},'PageViews':{'PerMillion':'489.97','PerUser':'1.38'}},'Global':{'Rank':' 208'}},{'DataUrl':'chegg.com','Country':{'Rank':'98','Reach':{'PerMillion':'6280'},'PageViews':{'PerMillion ':'882.3','PerUser':'6.2'}},'Global':{'Rank':'402'}},{'DataUrl':'mlb.com','Country':{'Rank' :'99','Reach':{'PerMillion':'7280'},'PageViews':{'PerMillion':'564.1','PerUser':'3.42'}},'Global':{'Rank' :'427'}},{'DataUrl':'xnxx.com','Country':{'Rank':'100','Reach':{'PerMillion':'5560'},'PageViews':{ 'PerMillion':'1271','PerUser':'10 .1'}},'Global':{'Rank':'95'}
我尝试了各种FindAll表达式。
答案 0 :(得分:1)
Python有一个名为json的内置程序包,可用于处理JSON数据。
您可以将python对象转换为json对象,然后轻松获取DataUrl。
答案 1 :(得分:1)
它看起来像JSON
数据的一部分,因此,如果您有完整的JSON
数据,则可以使用模块json
进行加载并在字典中搜索DataUrl
。
如果您的JSON数据不完整,则可以使用regex
text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''
import re
urls = re.findall("'DataUrl': '([^']*)'", text)
print(urls)
结果
['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']
您也可以尝试使用.split("{'DataUrl': '")
和split("',")
text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''
urls = text.split("{'DataUrl': '")
urls = [item.split("',")[0] for item in urls if item]
print(urls)
结果
['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']
如果您具有完整且格式正确的JSON-使用"
而不是'
-那么您可以使用模块json
我在这里使用完整的JSON
text = '''[{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}}]'''
text = text.replace("'", '"')
import json
data = json.loads(text)
urls = [item['DataUrl'] for item in data]
print(urls)
结果
['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']
答案 2 :(得分:-2)
答案是针对非json数据的
您可以使用正则表达式来检测任何文本中的url。 在python中使用正则表达式的方式是:
此处显示了该链接的示例:
# Python code to find the URL from an input string
# Using the regular expression
import re
def Find(string):
# findall() has been used
# with valid conditions for urls in string
url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]
|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
return url
# Driver Code
string = 'My Profile: https://auth.geeksforgeeks.org
/ user / Chinmoy % 20Lenka / articles in
the portal of http://www.geeksforgeeks.org/'
print("Urls: ", Find(string))