Question

我的数据格式如下：

输入数据：

#@ <id_1d3s2ia_p3m_zkjp59>
<Eckhard_Christian>     <hasGender>     <ma<>le> .
#@ <id_1jmz109_1gi_t71dyx>
<Peter_Pinn<>e>   <created>       <In_Your_Arms_(Love_song_from_"Neighbours")> .
#@ <id_v9bcjt_ice_fraki6>
<Blanchester,_Ohio>     <hasWebsite>    <http://www.blanchester.com/> .
#@ <id_10tunwc_p3m_zkjp59>
<Hub_(bassi~st)> <hasGender>     <ma??le> .

输出数据：

<Eckhard_Christian>     <hasGender>     <male> <id_1d3s2ia_p3m_zkjp59>.
<Peter_Pinne>   <created>       <In_Your_Arms_(Love_song_from_"Neighbours")> <id_1jmz109_1gi_t71dyx>.
<Blanchester,_Ohio>     <hasWebsite>    <http://www.blanchester.com/> <id_v9bcjt_ice_fraki6>.
<Hub_(bassist)> <hasGender>     <male> <id_10tunwc_p3m_zkjp59>.

在输出数据中，我要删除所有其他字符，除了：字母数字和:，/，/，.，_，任意两个开始和结束( )之间的<，>。我知道python允许我使用string.split()进行拆分，但是如果我使用< >作为demarkers进行拆分，那么对于<ma<>le>我会得到(<ma,<>,le>)。

我是否有其他方式可以在python中拆分，以便我可以获得所需形式的数据。另外，我希望前面的行< >（# @之后）显示为最后一列。

Answer 1

假设在“正确的”<和>之前/之后总是有空白字符，您可以尝试使用regular expressions：

import re
with open('data') as data:
    for line in data:
        if line.startswith('#@'):
            id_ = re.search('\s(<.*>)', line).group(1)
            fields = re.findall('(<.*?>)\s', next(data))
            fields = ['<' + re.sub(r'[^\w:/._()"]', '', f) + '>' for f in fields]
            print fields + [id_]

输出：

['<Eckhard_Christian>', '<hasGender>', '<male>', '<id_1d3s2ia_p3m_zkjp59>']
['<Peter_Pinne>', '<created>', '<In_Your_Arms_(Love_song_from_"Neighbours")>', '<id_1jmz109_1gi_t71dyx>']
['<Blanchester_Ohio>', '<hasWebsite>', '<http://www.blanchester.com/>', '<id_v9bcjt_ice_fraki6>']
['<Hub_(bassist)>', '<hasGender>', '<male>', '<id_10tunwc_p3m_zkjp59>']

用户定义的方式在python中拆分字符串

1 个答案: