我有一个.txt文件,其格式类似于XML,但问题是从其检索的网站警告我这是无效的XML格式。通过一些解析,我设法使用infoTable作为参考将这些信息以这些大小的小块获取。
<infoTable>
<nameOfIssuer>COMPANYONE</nameOfIssuer>
<titleOfClass>SHS CLASS -A -</titleOfClass>
<cusip>00000</cusip>
<value>21944</value>
<shrsOrPrnAmt>
<sshPrnamt>3060500</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>3060500</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>COMPANYTWO</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>00001</cusip>
<value>67822</value>
<shrsOrPrnAmt>
<sshPrnamt>1898717</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>1898717</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
<infoTable>
<nameOfIssuer>COMPANYTHREE</nameOfIssuer>
<titleOfClass>CL B NEW</titleOfClass>
<cusip>00002</cusip>
<value>10462145</value>
<shrsOrPrnAmt>
<sshPrnamt>52078974</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<votingAuthority>
<Sole>52078974</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
我的问题是我不知道如何正确地从标记中提取值。我已经尝试过类似的事情
soup = BeautifulSoup("myData")
soup = find_all("nameOfIssuer")[0].readContent()
但是,这导致了我出界错误。同样的问题是,尽管此.txt没有显示它,但我从中获取的数据却缺少要作为NaN填写的列。因此,理想情况下,我尝试以tsv格式获取数据
NameofIssuer TitleofClass cusip value shrsPrnamt shrsPrnamtType putcall investmentDescrestion othermanager vaSole vaShared vaNone
COMPANYONE CL B NEW 00000 21944 3060500 SH NaN SOLE NaN 3060500 0 0
COMPANYTWO COM 00001 67822 1898717 SH NaN SOLE NaN 1898717 0 0
编辑:在@RomanPerekhrest的建议下,我包括了一个额外的XML文件,其中显示了othermanager
和putcall
标签
<ns1:infoTable>
<ns1:nameOfIssuer>COMPANYFOUR</ns1:nameOfIssuer>
<ns1:titleOfClass>COM</ns1:titleOfClass>
<ns1:cusip>00004</ns1:cusip>
<ns1:value>67</ns1:value>
<ns1:shrsOrPrnAmt>
<ns1:sshPrnamt>36100</ns1:sshPrnamt>
<ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
</ns1:shrsOrPrnAmt>
<ns1:putCall>Call</ns1:putCall>
<ns1:investmentDiscretion>DFND</ns1:investmentDiscretion>
<ns1:otherManager>01, 02</ns1:otherManager>
<ns1:votingAuthority>
<ns1:Sole>36100</ns1:Sole>
<ns1:Shared>0</ns1:Shared>
<ns1:None>0</ns1:None>
</ns1:votingAuthority>
</ns1:infoTable>
<ns1:infoTable>
<ns1:nameOfIssuer>COMPANYFIVE</ns1:nameOfIssuer>
<ns1:titleOfClass>SPONSORED ADS A</ns1:titleOfClass>
<ns1:cusip>00005</ns1:cusip>
<ns1:value>2695</ns1:value>
<ns1:shrsOrPrnAmt>
<ns1:sshPrnamt>339367</ns1:sshPrnamt>
<ns1:sshPrnamtType>SH</ns1:sshPrnamtType>
</ns1:shrsOrPrnAmt>
<ns1:investmentDiscretion>DFND</ns1:investmentDiscretion>
<ns1:otherManager>01, 02</ns1:otherManager>
<ns1:votingAuthority>
<ns1:Sole>339367</ns1:Sole>
<ns1:Shared>0</ns1:Shared>
<ns1:None>0</ns1:None>
</ns1:votingAuthority>
</ns1:infoTable>
答案 0 :(得分:1)
变量select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip
串联了有问题的字符串(link-太长了,无法在此处粘贴):
data
写import csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt', 'sshPrnamtType', 'putCall', 'investmentDiscretion', 'otherManager', 'Sole', 'Shared', 'None']
data = []
for info_table in soup.find_all(['ns1:infotable', 'infotable']):
row = []
for col in cols:
d = info_table.find([col.lower(), 'ns1:' + col.lower()])
row.append(d.text.strip() if d else 'NaN')
data.append(row)
headers = ['NameofIssuer', 'TitleofClass', 'cusip', 'value', 'shrsPrnamt', 'shrsPrnamtType', 'putcall', 'investmentDescrestion', 'othermanager', 'vaSole', 'vaShared', 'vaNone']
with open('data.csv', 'w', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
csvwriter.writerow(headers)
csvwriter.writerows(data)
:
data.csv
在LibreOffice中,它看起来是:
答案 1 :(得分:0)
带有lxml.etree
,OrderdedDict
和pandas
库的扩展解决方案:
我们首先需要修复格式错误的XML内容:主要思想是使用XML 命名空间(root
)添加ns1
标记。出于演示目的,将输入的 xml (直接从问题中获取)解析为字符串,并进行了进一步的修改。
from lxml import etree
import pandas as pd
import sys
from collections import OrderedDict
xml_content = '<root xmlns:ns1="http://base.google.com/ns/1.0">{}</root>'\
.format(open('base.xml').read())
doc = etree.fromstring(xml_content)
ns = {'ns1': 'http://base.google.com/ns/1.0'}
records = []
for block in doc.findall('ns1:infoTable', namespaces=ns):
d = OrderedDict()
for el in block.getchildren():
el_tag = el.tag.replace("{{{}}}".format(ns['ns1']), '')
inner_childs = el.getchildren()
if inner_childs: # if element has child nodes
prefix = 'va' if el_tag == 'votingAuthority' else ''
d.update({prefix + child.tag.replace("{{{}}}".format(ns['ns1']), ''): child.text
for child in inner_childs})
else:
d[el_tag] = el.text
records.append(d)
df = pd.DataFrame(records)
print(df.to_string(index=False, justify=True))
输出:
nameOfIssuer titleOfClass cusip value sshPrnamt sshPrnamtType putCall investmentDiscretion otherManager vaSole vaShared vaNone
COMPANYFOUR COM 00004 67 36100 SH Call DFND 01, 02 36100 0 0
COMPANYFIVE SPONSORED ADS A 00005 2695 339367 SH NaN DFND 01, 02 339367 0 0
要将结果保存到带有所需分隔符的csv文件中,请使用df.to_csv()
例程:
df.to_csv(path_or_buf='output.csv', sep='\t', index=False)