如何将正则表达式应用于列表的每个子列表?

时间:2015-05-19 05:08:31

标签: python regex list python-2.7 parsing

我们说我有一个这样的列表列表:

lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
        ['I just became the mayor of Porta Romana on @username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "@username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
        ]

我想删除每个子列表的链接,所以我尝试使用这个正则表达式:

new_list = re.sub(r'^https?:\/\/.*[\r\n]*', '', tweets, flags=re.MULTILINE)

我使用MULTILINE标志,因为当我打印list_时,它看起来像:

[]
[]
[]
...
[]

上述方法的问题在于我明显得到了TypeError: expected string or buffer我不能将这样的子列表传递给正则表达式。 如何将上述正则表达式应用于list_中的一组子列表,以获得类似的内容(即没有任何类型链接的子列表):

[['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware'],
        ['I just became the mayor of Porta Romana on @username! \t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated \t\t", "@username Don't use my family surname for your app ????\t\t"]
        ]

这可以通过地图完成,还是有其他有效的方法?

先谢谢你们

2 个答案:

答案 0 :(得分:1)

您似乎有list liststring s。

在这种情况下,您只需要以正确的方式迭代这些列表:

list_ = [['blablablalba', 'blabalbablbla', 'blablala', 'http://t.co/xSnsnlNyq5'], ['blababllba', 'blabalbla', 'blabalbal'],['http://t.co/xScsklNyq5'], ['blablabla', 'http://t.co/xScsnlNyq3']]

def remove_links(sublist):
    return [s for s in sublist if not re.search(r'https?:\/\/.*[\r\n]*', s)]

final_list = map(remove_links, list_)
# [['blablablalba', 'blabalbablbla', 'blablala'], ['blababllba', 'blabalbla', 'blabalbal'], [], ['blablabla']]

如果您想删除之后的任何空子列表:

final_final_list = [l for l in final_list if l]

答案 1 :(得分:1)

您需要使用\b而不是线锚的开始。

>>> lis_ = [['"Fun is the enjoyment of pleasure"\t\t',
         '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t','Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware https://t.co/k9sOEpKjbg\t\t'],
        ['I just became the mayor of Porta Romana on @username! http://4sq.com/9QROVv\t\t', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated http://t.co/heyOhpb1\t\t", "@username Don't use my family surname for your app ???? http://t.co/1yYLXIO9\t\t"]
        ]
>>> [[re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', i)] for x in lis_ for i in x]
[['"Fun is the enjoyment of pleasure"\t\t'], ['@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t'], ['Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on @username! '], ["RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated "], ["@username Don't use my family surname for your app ???? "]]

OR

>>> l = []
>>> for i in lis_:
        m = []
        for j in i:
            m.append(re.sub(r'\bhttps?:\/\/.*[\r\n]*', '', j))
        l.append(m)


>>> l
[['"Fun is the enjoyment of pleasure"\t\t', '@username det fanns ett utvik med "sabrina without a stitch". acke nothing. @username\t\t', 'Report by @username  - #JeSuisCharlie Movement Leveraged to Distribute DarkComet Malware '], ['I just became the mayor of Porta Romana on @username! ', "RT benturner83 Someone's chucking stuff out of the window of an office on tottenham court road #tcr street evacuated ", "@username Don't use my family surname for your app ???? "]]