我有一些代码可以很简单地输出:
Name
Workplace
And a abstract
然后重复一遍又一遍。所以:
NameA
WorkplaceA
And a abstractA
NameB
WorkplaceB
And a abstractB
etc...
我需要将其分为三列:
NameCol WorkplaceCol AbstractCol
NameA WorkplaceA AbstractA
NameB WorkplaceB AbstractB
NameC WorkplaceC AbstractC
etc...
当我的代码找到一个<h1>
标记时,它循环回到开始。但是,我不显示此标签。因此,一条记录是名称,工作场所和摘要,直到遇到新的<h1>
标签为止。
这是我的代码:
headernum = 0
i = 0
x = soup.find_all("h1")
for i in range(len(x)):
header = soup.find_all('h1')[headernum]
name = header.find_all_next('p')[1]
print(name.text)
workplace = name.find_all_next('i')[0]
print(workplace.text)
abstract = []
for elem in name.next_siblings:
if elem.name == 'h1':
break
if elem.name != 'p':
continue
abstract.append(elem.get_text())
x = " ".join(abstract).replace("\n", " ").encode('utf-8')
print(x)
i += 1
headernum += 1
我正在努力将其拆分并放入列中。
答案 0 :(得分:0)
假设您拥有这样的df:
col1
NameA
WorkplaceA
AbstractA
NameB
WorkplaceB
AbstractB
您可以:
import numpy as np
# Set the same number for each 3 lines
df['index'] = df.index / 3
df['index'] = df['index'].apply(np.floor)
# Set 0 for Names, 1 for Workplaces and 2 for Abstract
df["type_id"] = df.index % 3
# Rename 0, 1 and 2 by a label
df["type_label"] = df["type_id"].map({0: "Name", 1: "Workplace", 2: "Abstract"})
# Pivot the table
df = df.pivot(index='index', columns='type_label', values='col1')
print(df)
它将给您:
type_label Abstract Name Workplace
index
0.0 AbstractA NameA WorkplaceA
1.0 AbstractB NameB WorkplaceB
答案 1 :(得分:0)
如果要处理自己的输入格式,则需要 一些假设。对于此代码示例,我假设“ h1”出现在三行之间。如果中间允许,则代码需要稍有不同。
想法:
编写一个生成器函数,该函数循环遍历文本并以字典形式返回每一整行。
全部收集
当您将问题标记为“ pandas”时,将结果移至pandas数据框
这是一个可行的示例。
import pandas as pd
example_text="""NameA
WorkplaceA
And a abstractA
NameB
WorkplaceB
And a abstractB
<h1>
NameC
WorkplaceC
And a abstractC"""
def next_name(mystr):
lines = iter(mystr.split('\n'))
while True:
n = {'NameCol':None,
'WorkplaceCol':None,
'AbstractCol':None
}
try:
n['NameCol'] = next(lines)
if n['NameCol'] == '<h1>':
continue
n['WorkplaceCol'] = next(lines)
if n['WorkplaceCol'] == '<h1>':
continue
n['AbstractCol'] = next(lines)
if n['AbstractCol'] == '<h1>':
continue
yield n
except StopIteration:
break
df = pd.DataFrame(next_name(example_text), columns=['NameCol','WorkplaceCol','AbstractCol'])
print(df)
数据框打印为
NameCol WorkplaceCol AbstractCol
0 NameA WorkplaceA And a abstractA
1 NameB WorkplaceB And a abstractB
2 NameC WorkplaceC And a abstractC
如果您需要完全按照示例打印数据框, 这是示例代码。
print(''.join(f'{x}\t' for x in df.columns))
print()
for row in df.iterrows():
print(''.join(f'{x}\t' for x in row[1]))
输出
NameCol WorkplaceCol AbstractCol
NameA WorkplaceA And a abstractA
NameB WorkplaceB And a abstractB
NameC WorkplaceC And a abstractC
注意:我使用的是Python 3.6,如果您使用的是旧版本,则需要更改print命令。
相比之下,使用Pandas可以做到这一点(使用上面代码中的示例)
df = pd.DataFrame(example_text.split('\n'))
df = df[df[0] != '<h1>'].reset_index().copy()
df['row'] = df.index // 3
result = df.groupby('row').agg(lambda x: list(x))[0].values
print('\t'.join(["NameCol", "WorkplaceCol", "AbstractCol"]))
print('')
print('\n'.join(['\t'.join(x) for x in result]))
输出相同。
NameCol WorkplaceCol AbstractCol
NameA WorkplaceA And a abstractA
NameB WorkplaceB And a abstractB
NameC WorkplaceC And a abstractC