如何从字符串中提取网址数据

时间:2019-04-26 18:49:01

标签: python string extract

我有以下包含许多Url值的字符串。如何在此字符串中的DataUrl项后提取Url?所以我得到了Urls的清单 例如:americanexpress.com,Vice.com,chegg.com

{'DataUrl':'americanexpress.com','Country':{'Rank':'96','Reach':{'PerMillion':'7350'},'PageViews':{'PerMillion': '600.2','PerUser':'3.6'}},'Global':{'Rank':'362'}},{'DataUrl':'vice.com','Country':{'Rank':' 97”,“达到”:{'PerMillion':'15703.61'},'PageViews':{'PerMillion':'489.97','PerUser':'1.38'}},'Global':{'Rank':' 208'}},{'DataUrl':'chegg.com','Country':{'Rank':'98','Reach':{'PerMillion':'6280'},'PageViews':{'PerMillion ':'882.3','PerUser':'6.2'}},'Global':{'Rank':'402'}},{'DataUrl':'mlb.com','Country':{'Rank' :'99','Reach':{'PerMillion':'7280'},'PageViews':{'PerMillion':'564.1','PerUser':'3.42'}},'Global':{'Rank' :'427'}},{'DataUrl':'xnxx.com','Country':{'Rank':'100','Reach':{'PerMillion':'5560'},'PageViews':{ 'PerMillion':'1271','PerUser':'10 .1'}},'Global':{'Rank':'95'}

我尝试了各种FindAll表达式。

3 个答案:

答案 0 :(得分:1)

Python有一个名为json的内置程序包,可用于处理JSON数据。

您可以将python对象转换为json对象,然后轻松获取DataUrl。

请参阅https://www.w3schools.com/python/python_json.asp

答案 1 :(得分:1)

它看起来像JSON数据的一部分,因此,如果您有完整的JSON数据,则可以使用模块json进行加载并在字典中搜索DataUrl

如果您的JSON数据不完整,则可以使用regex

text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''

import re

urls = re.findall("'DataUrl': '([^']*)'", text)

print(urls)

结果

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

您也可以尝试使用.split("{'DataUrl': '")split("',")

text = '''{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}'''

urls = text.split("{'DataUrl': '")
urls = [item.split("',")[0] for item in urls if item]
print(urls)

结果

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

如果您具有完整且格式正确的JSON-使用"而不是'-那么您可以使用模块json

我在这里使用完整的JSON

text = '''[{'DataUrl': 'americanexpress.com', 'Country': {'Rank': '96', 'Reach': {'PerMillion': '7350'}, 'PageViews': {'PerMillion': '600.2', 'PerUser': '3.6'}}, 'Global': {'Rank': '362'}}, {'DataUrl': 'vice.com', 'Country': {'Rank': '97', 'Reach': {'PerMillion': '15703.61'}, 'PageViews': {'PerMillion': '489.97', 'PerUser': '1.38'}}, 'Global': {'Rank': '208'}}, {'DataUrl': 'chegg.com', 'Country': {'Rank': '98', 'Reach': {'PerMillion': '6280'}, 'PageViews': {'PerMillion': '882.3', 'PerUser': '6.2'}}, 'Global': {'Rank': '402'}}, {'DataUrl': 'mlb.com', 'Country': {'Rank': '99', 'Reach': {'PerMillion': '7280'}, 'PageViews': {'PerMillion': '564.1', 'PerUser': '3.42'}}, 'Global': {'Rank': '427'}}, {'DataUrl': 'xnxx.com', 'Country': {'Rank': '100', 'Reach': {'PerMillion': '5560'}, 'PageViews': {'PerMillion': '1271', 'PerUser': '10.1'}}, 'Global': {'Rank': '95'}}]'''
text = text.replace("'", '"')

import json

data = json.loads(text)
urls = [item['DataUrl'] for item in data]

print(urls)

结果

['americanexpress.com', 'vice.com', 'chegg.com', 'mlb.com', 'xnxx.com']

答案 2 :(得分:-2)

答案是针对非json数据的

您可以使用正则表达式来检测任何文本中的url。 在python中使用正则表达式的方式是:

geeksforgeeks answer

此处显示了该链接的示例:

# Python code to find the URL from an input string 
# Using the regular expression 
import re 

def Find(string): 
    # findall() has been used  
    # with valid conditions for urls in string 
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+] 
    |[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string) 
    return url 

# Driver Code 
string = 'My Profile: https://auth.geeksforgeeks.org 
/ user / Chinmoy % 20Lenka / articles in 
the portal of http://www.geeksforgeeks.org/' 
print("Urls: ", Find(string))