我有我国家的报纸网站的网址和标题列表。作为一般示例:
x = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1']
每个URL元素都有一个对应的'news'元素序列,它们的长度可以不同。在上面的示例中,URL1有3条对应的新闻,URL3只有一条。
有时,URL没有相应的“新闻”元素:
y = ['URL4','news1','news2','URL5','URL6','news1']
我可以轻松找到每个URL索引以及每个URL的“新闻”元素。
我的问题是:是否可以将此列表转换为以URL元素为键而“ news”元素为列表/元组值的字典?
预期产量
z = {'URL1':('news1', 'news2', 'news3'),
'URL2':('news1', 'news2'),
'URL3':('news1'),
'URL4':('news1', 'news2'),
'URL5':(),
'URL6':('news1')}
我在此post中看到了类似的问题,但并不能解决我的问题。
答案 0 :(得分:11)
您可以这样做:
>>> y = ['URL4','news1','news2','URL5','URL6','news1']
>>> result = {}
>>> current_url = None
>>> for entry in y:
... if entry.startswith('URL'):
... current_url = entry
... result[current_url] = ()
... else:
... result[current_url] += (entry, )
...
>>> result
{'URL4': ('news1', 'news2'), 'URL5': (), 'URL6': ('news1',)}
答案 1 :(得分:3)
您可以将itertools.groupby
与key
函数一起使用来标识URL:
from itertools import groupby
def _key(url):
return url.startswith("URL") #in the body of _key, write code to identify a URL
data = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1', 'URL4','news1','news2','URL5','URL6','news1']
new_d = [list(b) for _, b in groupby(data, key=_key)]
grouped = [[new_d[i], tuple(new_d[i+1])] for i in range(0, len(new_d), 2)]
result = dict([i for [*c, a], b in grouped for i in [(i, ()) for i in c]+[(a, b)]])
输出:
{
'URL1': ('news1', 'news2', 'news3'),
'URL2': ('news1', 'news2'),
'URL3': ('news1',),
'URL4': ('news1', 'news2'),
'URL5': (),
'URL6': ('news1',)
}
答案 2 :(得分:3)
您可以只使用列表中URL密钥的索引并获取索引之间的内容并分配给第一个
赞:
x = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1']
urls = [x.index(y) for y in x if 'URL' in y]
adict = {}
for i in range(0, len(urls)):
if i == len(urls)-1:
adict[x[urls[i]]] = x[urls[i]+1:len(x)]
else:
adict[x[urls[i]]] = x[urls[i]+1:urls[i+1]]
print(adict)
输出:
{'URL1': ['news1', 'news2', 'news3'], 'URL2': ['news1', 'news2'], 'URL3': ['news1']}
答案 3 :(得分:2)
more-itertools library包含一个函数split_before()
,为此目的,它非常方便:
{s[0]: tuple(s[1:]) for s in mt.split_before(x, lambda e: e.startswith('URL'))}
我认为这比在此之前发布的答案中的任何其他方法都更干净,但是它确实引入了外部依赖关系(除非您重新实现该功能),这使其不适用于每种情况。
如果您的实际用例涉及真实的URL或其他内容,而不是URL#
形式的字符串,则只需用lambda e: e.startswith('URL')
替换为可以用来选择键值之外的任何键值的任何函数元素。
答案 4 :(得分:1)
使用groupby
(单线)的另一种解决方案:
x = ['URL1','news1','news2','news3','URL2','news1','news2','URL3','news1', 'URL4','news1','news2','URL5','URL6','news1']
from itertools import groupby
out = {k: tuple(v) for _, (k, *v) in groupby(x, lambda k, d={'g':0}: (d.update(g=d['g']+1), d['g']) if k.startswith('URL') else (None, d['g']))}
from pprint import pprint
pprint(out)
打印:
{'URL1': ('news1', 'news2', 'news3'),
'URL2': ('news1', 'news2'),
'URL3': ('news1',),
'URL4': ('news1', 'news2'),
'URL5': (),
'URL6': ('news1',)}