python用不同的条件抓取多个字符串

时间:2014-09-19 09:44:58

标签: python regex

我的文字如下:

<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF> acting career with <COREF ID="22">Biwi Ho</COREF> To <COREF ID="24">Aisi</COREF> but <COREF ID="18" REF="17">it</COREF> was <COREF ID="9" REF="2">his</COREF> second film <COREF ID="25">Maine Pyar</COREF> <COREF ID="26">Kiya</COREF>(1989), in which <COREF ID="10" REF="2">he</COREF> acted in a lead role, that garnered <COREF ID="11" REF="2">him</COREF> the Filmfare Award for Best Male Debut. <COREF ID="12" REF="2">Khan</COREF> has starred in several commercially successful films, such as <COREF ID="28">Saajan</COREF> (1991), <COREF ID="29">Hum Aapke Hain Koun</COREF>..! (1994), <COREF ID="30">Karan Arjun</COREF> (1995),<COREF ID="31">Judwaa</COREF> (1997), <COREF ID="32">Pyar</COREF> <COREF ID="27" REF="26">Kiya</COREF> To Darna <COREF ID="33">Kya</COREF> (1998), <COREF ID="23" REF="22">Biwi</COREF> No.1 (1999), and Hum Saath <COREF ID="34">Saath Hain</COREF> (1999), having appeared in the highest grossing film nine separate years during <COREF ID="13" REF="2">his</COREF> career, a record that remains unbroken.[4]

我想做的是

  1. 获取每个ID的字符串
  2. 仅获取具有REF的ID。结果应该提供ID字符串和REF字符串。如果我们有IDREF num,那么我们可以使用map数据结构从结果1中收集字符串
  3. 我试过这样的方式:

    def doit(text):      
      import re
      matches=re.findall(r'\>(.+?)\<',text)
      # matches is now ['String 1', 'String 2', 'String3']
      return ",".join(matches)
    print doit(string)
    

    单独生成所有字符串

    现在废弃我以这种方式做的每个ID:

    def doit(text):      
        import re
        #matches = re.findall((?<="ID=")(.*)(?=""))
        matches = re.findall(r'ID=\"(\d+)', text)
        return ",".join(matches)
    
    print doit(string)
    

    要废弃ID=""之间的内容,即ID号,但会出错

    SyntaxError: invalid syntax
    

    我做错了什么。还有更好的选择吗?

    更新

    string = "<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF>"
    
    def doit(text):      
        import re
        #matches = re.findall((?<="ID=")(.*)(?=""))
        matches = re.findall(r'ID=\"(\d+)', text)
        return ",".join(matches)
    
    print doit(string)
    

1 个答案:

答案 0 :(得分:1)

如果您只想要ID并且它们都是数字,请尝试:

re.findall(r'ID=\"(\d+)', text)

d + 只会捕获数字。