应用错误收集

从某种格式的字符串中提取数据

时间：2018-05-06 22:03:37

标签： python regex split

我收到了这种格式的字符串：

GETMOVIE#genre:Action&year:1990-2007&country:USA
GETMOVIE#genre:Animation&year:2000-2010&country:Russia
GETMOVIE#genre:X&year:Y&country:Z

我想知道如何将XYZ从这些字符串中提取到strings \ list中我尝试过切片，但它是impposibole。一些提示？

3 个答案:

答案 0 :(得分：2)

为什么拆分不可能？

这是一个很好的单行：

s = "GETMOVIE#genre:Animation&year:2000-2010&country:Russia"
d = dict(p.split(':', 1) for p in s.partition("#")[2].split("&"))
print(d)

答案 1 :(得分：1)

import re

line = 'GETMOVIE#genre:Action&year:1990-2007&country:USA'
pattern = r'^GETMOVIE#genre:(.+)&year:(.+)&country:(.+)$'
genre, year, country = re.match(pattern, line).groups()
print(genre, year, country)  # Action 1990-2007 USA

答案 2 :(得分：1)

您可以使用str.split()，例如：

代码：

data = [x.strip() for x in """
    GETMOVIE#genre:Action&year:1990-2007&country:USA
    GETMOVIE#genre:Animation&year:2000-2010&country:Russia
    GETMOVIE#genre:X&year:Y&country:Z
""".split('\n')[1:-1]]

print(data)
print(process_data(data))

测试代码：

['GETMOVIE#genre:Action&year:1990-2007&country:USA', 
 'GETMOVIE#genre:Animation&year:2000-2010&country:Russia', 
 'GETMOVIE#genre:X&year:Y&country:Z']

{'GETMOVIE': [
    {'genre': 'Action', 'year': '1990-2007', 'country': 'USA'}, 
    {'genre': 'Animation', 'year': '2000-2010', 'country': 'Russia'}, 
    {'genre': 'X', 'year': 'Y', 'country': 'Z'}
]}

结果：

groupByGender_df = purchase_data_df.groupby([‘Gender’])
gender = purchase_data_df[“Gender”].value_counts()

groupByGender_df[“Price”].sum(),
                           “Normalized Price”: groupByGender_df[“Price”].mean().value_counts(normalize=True)})
summaryTable.head()