从信息表中提取数据,同时还填写缺失值

时间:2019-07-17 18:38:55

标签: python beautifulsoup

我有一个.txt文件,其格式类似于XML,但问题是从其检索的网站警告我这是无效的XML格式。通过一些解析,我设法使用infoTable作为参考将这些信息以这些大小的小块获取。

  <infoTable>

    <nameOfIssuer>COMPANYONE</nameOfIssuer>

    <titleOfClass>SHS CLASS -A -</titleOfClass>

    <cusip>00000</cusip>

    <value>21944</value>

    <shrsOrPrnAmt>

      <sshPrnamt>3060500</sshPrnamt>

      <sshPrnamtType>SH</sshPrnamtType>

    </shrsOrPrnAmt>

    <investmentDiscretion>SOLE</investmentDiscretion>

    <votingAuthority>

      <Sole>3060500</Sole>

      <Shared>0</Shared>

      <None>0</None>

    </votingAuthority>

  </infoTable>

  <infoTable>

    <nameOfIssuer>COMPANYTWO</nameOfIssuer>

    <titleOfClass>COM</titleOfClass>

    <cusip>00001</cusip>

    <value>67822</value>

    <shrsOrPrnAmt>

      <sshPrnamt>1898717</sshPrnamt>

      <sshPrnamtType>SH</sshPrnamtType>

    </shrsOrPrnAmt>

    <investmentDiscretion>SOLE</investmentDiscretion>

    <votingAuthority>

      <Sole>1898717</Sole>

      <Shared>0</Shared>

      <None>0</None>

    </votingAuthority>

  </infoTable>

  <infoTable>

    <nameOfIssuer>COMPANYTHREE</nameOfIssuer>

    <titleOfClass>CL B NEW</titleOfClass>

    <cusip>00002</cusip>

    <value>10462145</value>

    <shrsOrPrnAmt>

      <sshPrnamt>52078974</sshPrnamt>

      <sshPrnamtType>SH</sshPrnamtType>

    </shrsOrPrnAmt>

    <investmentDiscretion>SOLE</investmentDiscretion>

    <votingAuthority>

      <Sole>52078974</Sole>

      <Shared>0</Shared>

      <None>0</None>

    </votingAuthority>

  </infoTable>

我的问题是我不知道如何正确地从标记中提取值。我已经尝试过类似的事情

soup = BeautifulSoup("myData") soup = find_all("nameOfIssuer")[0].readContent()

但是,这导致了我出界错误。同样的问题是,尽管此.txt没有显示它,但我从中获取的数据却缺少要作为NaN填写的列。因此,理想情况下,我尝试以tsv格式获取数据

NameofIssuer TitleofClass cusip value   shrsPrnamt  shrsPrnamtType  putcall  investmentDescrestion  othermanager   vaSole  vaShared   vaNone
COMPANYONE   CL B NEW     00000 21944   3060500     SH              NaN      SOLE                   NaN            3060500 0          0
COMPANYTWO   COM          00001 67822   1898717     SH              NaN      SOLE                   NaN            1898717 0          0

编辑:在@RomanPerekhrest的建议下,我包括了一个额外的XML文件,其中显示了othermanagerputcall标签

<ns1:infoTable>
        <ns1:nameOfIssuer>COMPANYFOUR</ns1:nameOfIssuer>
        <ns1:titleOfClass>COM</ns1:titleOfClass>
        <ns1:cusip>00004</ns1:cusip>
        <ns1:value>67</ns1:value>
        <ns1:shrsOrPrnAmt>
            <ns1:sshPrnamt>36100</ns1:sshPrnamt>
            <ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
        </ns1:shrsOrPrnAmt>
        <ns1:putCall>Call</ns1:putCall>
        <ns1:investmentDiscretion>DFND</ns1:investmentDiscretion>
        <ns1:otherManager>01, 02</ns1:otherManager>
        <ns1:votingAuthority>
            <ns1:Sole>36100</ns1:Sole>
            <ns1:Shared>0</ns1:Shared>
            <ns1:None>0</ns1:None>
        </ns1:votingAuthority>
    </ns1:infoTable>
    <ns1:infoTable>
        <ns1:nameOfIssuer>COMPANYFIVE</ns1:nameOfIssuer>
        <ns1:titleOfClass>SPONSORED ADS A</ns1:titleOfClass>
        <ns1:cusip>00005</ns1:cusip>
        <ns1:value>2695</ns1:value>
        <ns1:shrsOrPrnAmt>
            <ns1:sshPrnamt>339367</ns1:sshPrnamt>
            <ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
        </ns1:shrsOrPrnAmt>
        <ns1:investmentDiscretion>DFND</ns1:investmentDiscretion>
        <ns1:otherManager>01, 02</ns1:otherManager>
        <ns1:votingAuthority>
            <ns1:Sole>339367</ns1:Sole>
            <ns1:Shared>0</ns1:Shared>
            <ns1:None>0</ns1:None>
        </ns1:votingAuthority>
    </ns1:infoTable>

2 个答案:

答案 0 :(得分:1)

变量select sub.parent_id, sub.cusip, min(sub.timestamp) min_timestamp, sum(sub.quantity) quantity from ( select base_sub.*, case when base_sub.self_parent_id is not null then base_sub.self_parent_id else lag(base_sub.self_parent_id) ignore nulls over ( partition by my_table.cusip order by my_table.timestamp, my_table.id ) parent_id from ( select my_table.id, my_table.cusip, my_table.timestamp, my_table.quantity, lag(my_table.timestamp) over ( partition by my_table.cusip order by my_table.timestamp, my_table.id ) previous_timestamp, case when datediff( second, nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')), my_table.timestamp) > 30 then my_table.id else null end self_parent_id from my_table ) base_sub ) sub group by sub.time_group_parent_id, sub.cusip 串联了有问题的字符串(link-太长了,无法在此处粘贴):

data

import csv from bs4 import BeautifulSoup soup = BeautifulSoup(data, 'lxml') cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt', 'sshPrnamtType', 'putCall', 'investmentDiscretion', 'otherManager', 'Sole', 'Shared', 'None'] data = [] for info_table in soup.find_all(['ns1:infotable', 'infotable']): row = [] for col in cols: d = info_table.find([col.lower(), 'ns1:' + col.lower()]) row.append(d.text.strip() if d else 'NaN') data.append(row) headers = ['NameofIssuer', 'TitleofClass', 'cusip', 'value', 'shrsPrnamt', 'shrsPrnamtType', 'putcall', 'investmentDescrestion', 'othermanager', 'vaSole', 'vaShared', 'vaNone'] with open('data.csv', 'w', newline='') as csvfile: csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) csvwriter.writerow(headers) csvwriter.writerows(data)

data.csv

在LibreOffice中,它看起来是:

enter image description here

答案 1 :(得分:0)

带有lxml.etreeOrderdedDictpandas库的扩展解决方案:

我们首先需要修复格式错误的XML内容:主要思想是使用XML 命名空间root)添加ns1标记。出于演示目的,将输入的 xml (直接从问题中获取)解析为字符串,并进行了进一步的修改。

from lxml import etree
import pandas as pd
import sys
from collections import OrderedDict

xml_content = '<root xmlns:ns1="http://base.google.com/ns/1.0">{}</root>'\
    .format(open('base.xml').read())
doc = etree.fromstring(xml_content)
ns = {'ns1': 'http://base.google.com/ns/1.0'}
records = []

for block in doc.findall('ns1:infoTable', namespaces=ns):
    d = OrderedDict()
    for el in block.getchildren():
        el_tag = el.tag.replace("{{{}}}".format(ns['ns1']), '')
        inner_childs = el.getchildren()
        if inner_childs:    # if element has child nodes
            prefix = 'va' if el_tag == 'votingAuthority' else ''
            d.update({prefix + child.tag.replace("{{{}}}".format(ns['ns1']), ''): child.text
                      for child in inner_childs})
        else:
            d[el_tag] = el.text
    records.append(d)

df = pd.DataFrame(records)
print(df.to_string(index=False, justify=True))

输出:

nameOfIssuer     titleOfClass  cusip value sshPrnamt sshPrnamtType putCall investmentDiscretion otherManager  vaSole vaShared vaNone
 COMPANYFOUR              COM  00004    67     36100            SH    Call                 DFND       01, 02   36100        0      0
 COMPANYFIVE  SPONSORED ADS A  00005  2695    339367            SH     NaN                 DFND       01, 02  339367        0      0

要将结果保存到带有所需分隔符的csv文件中,请使用df.to_csv()例程:

df.to_csv(path_or_buf='output.csv', sep='\t', index=False)