如何将此文本转换为熊猫数据框?

时间:2020-05-09 21:57:58

标签: python pandas file dataframe

我有一个没有csv格式的文件,内容是这个。

文件:

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible 
`spectrum]] is absorbed through [[photosynthesis]])
"

"TITULO: Albedo SUBTITULO Y PARRAFO: ===Human activities===
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas 
around 
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ==U.S. House of Representatives, 1847–1849==
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle 
True to his record, Lincoln professed to friends in 1861 to be ""an old line Whig,
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===Re-election===
{{Main|1864 United States presidential election}}
[[File:ElectoralCollege1864.svg|thumb|upright=1.3|alt=Map of the 
"TITULO: Algeria SUBTITULO Y PARRAFO: ===Research and alternative energy sources===
Algeria has invested an estimated 100 billion dinars towards developing research facilities and 
paying researchers. 
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]

我有这个程序

import pandas as pd

data = pd.read_csv('datos_titulos.csv', header = None)
print(data)

我有此错误:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3

数据框表必须是

Tile                   Head                          TXT
Albedo                 Trees                         Because forests generally have a low  ...([[photosynthesis]])
Albedo                 Human activities              Human activities (e.g., de...areas around 
Abraham Lincoln        U.S. House of..1849           [[File:Abraham Lincoln by... line Whig,
.
.
.
Agricultural science  Fields or related disciplines  {{Col-begin}} {{Col-break}}...* [[Agricultural engineering]]

也就是说, 列标题为titulo。 头是párrafoy subtitulo ==此文字== txt是下一个标题的文字。

1 个答案:

答案 0 :(得分:0)

IIUC,您可以在re.module中使用两组正则表达式,首先,我们将遍历您的文本文件以获取标题和头字段。

其次,我们将使用re.split来收集文本字段,这是基于以下假设:尽管您的数据是凌乱的文本格式,但它仍然保持标题>标题>文本的顺序。

您将需要对Text列进行进一步的清理,但这很有趣:)

import re
from collections import defaultdict
import pandas as pd

pandas_dict = defaultdict(list)

with open("file.txt", "r") as f:
    for line in f:

      pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
      if re.search(pat, line):
          pandas_dict["title"].append(re.search(pat, line).group(1))
          pandas_dict["head"].append(re.search(pat, line).group(2))


with open("file.txt", "r") as f:
    body = f.read()

    b = re.split(r"===", body.strip())

    for line in b[2::2]:
        pandas_dict["text"].append(line.strip())

df = pd.DataFrame(pandas_dict)

print(df)

             title                                     head                                               text
0           Albedo                                    Trees  Because forests generally have a low albedo, (...
1           Albedo                         Human activities  Human activities (e.g., deforestation, farming...
2  Abraham Lincoln                              Re-election  {{Main|1864 United States presidential electio...
3          Algeria  Research and alternative energy sources  Algeria has invested an estimated 100 billion ...

print(df[df['Title'] == 'Algeria']['Text'])

paying researchers.
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]"