用美丽的汤从字符串中剔除不同的属性

时间:2014-09-19 12:13:13

标签: python regex beautifulsoup

我有这种形式的字符串:<COREF ID="3" REF="2"> Jacks Smith </COREF>

我正在使用idrefJacks Smith的值

import re
from bs4 import BeautifulSoup

string = '<COREF ID="1">Salman</COREF> <COREF ID="2">Khan</COREF> (pronunciation born <COREF ID="3" REF="2">Abdul Rashid Salim Salman Khan</COREF> on 27 December 1965)[3] is an <COREF ID="14">Indian</COREF> film <COREF ID="15">actor</COREF>, <COREF ID="17">producer</COREF>, television <COREF ID="19">presenter</COREF>, and <COREF ID="20">philanthropist</COREF> known for <COREF ID="4" REF="2">his</COREF> Hindi films. <COREF ID="5" REF="2">He</COREF> is the <COREF ID="21">son</COREF> of <COREF ID="16" REF="15">actor</COREF> and screenwriter Salim <COREF ID="6" REF="2">Khan</COREF>. <COREF ID="7" REF="2">Khan</COREF> began <COREF ID="8" REF="2">his</COREF> acting career with <COREF ID="22">Biwi Ho</COREF> To <COREF ID="24">Aisi</COREF> but <COREF ID="18" REF="17">it</COREF> was <COREF ID="9" REF="2">his</COREF> second film <COREF ID="25">Maine Pyar</COREF> <COREF ID="26">Kiya</COREF>(1989), in which <COREF ID="10" REF="2">he</COREF> acted in a lead role, that garnered <COREF ID="11" REF="2">him</COREF> the Filmfare Award for Best Male Debut. <COREF ID="12" REF="2">Khan</COREF> has starred in several commercially successful films, such as <COREF ID="28">Saajan</COREF> (1991), <COREF ID="29">Hum Aapke Hain Koun</COREF>..! (1994), <COREF ID="30">Karan Arjun</COREF> (1995),<COREF ID="31">Judwaa</COREF> (1997), <COREF ID="32">Pyar</COREF> <COREF ID="27" REF="26">Kiya</COREF> To Darna <COREF ID="33">Kya</COREF> (1998), <COREF ID="23" REF="22">Biwi</COREF> No.1 (1999), and Hum Saath <COREF ID="34">Saath Hain</COREF> (1999), having appeared in the highest grossing film nine separate years during <COREF ID="13" REF="2">his</COREF> career, a record that remains unbroken.[4]'
soup = BeautifulSoup(string)
#print soup

a = []

for entry in soup.find_all('coref'):
    print entry.name.string  # Issue this prints "coref" rather printing the result like "Salam" for first id
    print entry['id']   
    #a[entry['id']] = entry.name.string  Here I want to create array for ID and it should content corresponding string

for entry in soup.find_all('coref'):
  if entry.has_key('ref'):
    print entry
    print entry['ref']
    #print a[entry['ref']]    printing here refering entity from array a

的问题:

  1. print entry.name.string未提供字符串结果
  2. 这是将id存储在数组键中并指定字符串值
  3. 的正确方法
  4. 发出警告/usr/local/lib/python2.7/dist-packages/bs4/element.py:1413: UserWarning: has_key is deprecated. Use has_attr("ref") instead. key))
  5. 任何帮助都会很明显

1 个答案:

答案 0 :(得分:0)

  1. entry.name是指代码的名称。您应该使用entry.get_text()
  2. 您无法插入任意位置的列表。您应该将a声明为词典(a = {})。
  3. 如警告所述,函数entry.has_key已弃用,可能会在下一版本中删除。您应该使用if entry.has_attr('ref'): ...
  4. 之类的内容替换它