我有一个没有csv格式的文件,内容是这个。
文件:
"TITULO: Albedo SUBTITULO Y PARRAFO: ===Trees===
Because forests generally have a low albedo, (the majority of the ultraviolet and [[visible
`spectrum]] is absorbed through [[photosynthesis]])
"
"TITULO: Albedo SUBTITULO Y PARRAFO: ===Human activities===
Human activities (e.g., deforestation, farming, and urbanization) change the albedo of various areas
around
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ==U.S. House of Representatives, 1847–1849==
[[File:Abraham Lincoln by Nicholas Shepherd, 1846-crop.jpg|thumb|upright|alt=Middle
True to his record, Lincoln professed to friends in 1861 to be ""an old line Whig,
"TITULO: Abraham Lincoln SUBTITULO Y PARRAFO: ===Re-election===
{{Main|1864 United States presidential election}}
[[File:ElectoralCollege1864.svg|thumb|upright=1.3|alt=Map of the
"TITULO: Algeria SUBTITULO Y PARRAFO: ===Research and alternative energy sources===
Algeria has invested an estimated 100 billion dinars towards developing research facilities and
paying researchers.
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]
我有这个程序
import pandas as pd
data = pd.read_csv('datos_titulos.csv', header = None)
print(data)
我有此错误:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 3
数据框表必须是
Tile Head TXT
Albedo Trees Because forests generally have a low ...([[photosynthesis]])
Albedo Human activities Human activities (e.g., de...areas around
Abraham Lincoln U.S. House of..1849 [[File:Abraham Lincoln by... line Whig,
.
.
.
Agricultural science Fields or related disciplines {{Col-begin}} {{Col-break}}...* [[Agricultural engineering]]
也就是说, 列标题为titulo。 头是párrafoy subtitulo ==此文字== txt是下一个标题的文字。
答案 0 :(得分:0)
IIUC,您可以在re.module中使用两组正则表达式,首先,我们将遍历您的文本文件以获取标题和头字段。
其次,我们将使用re.split
来收集文本字段,这是基于以下假设:尽管您的数据是凌乱的文本格式,但它仍然保持标题>标题>文本的顺序。>
您将需要对Text
列进行进一步的清理,但这很有趣:)
import re
from collections import defaultdict
import pandas as pd
pandas_dict = defaultdict(list)
with open("file.txt", "r") as f:
for line in f:
pat = r"TITULO: (.*) SUBTITULO Y PARRAFO: ===(.*?)==="
if re.search(pat, line):
pandas_dict["title"].append(re.search(pat, line).group(1))
pandas_dict["head"].append(re.search(pat, line).group(2))
with open("file.txt", "r") as f:
body = f.read()
b = re.split(r"===", body.strip())
for line in b[2::2]:
pandas_dict["text"].append(line.strip())
df = pd.DataFrame(pandas_dict)
print(df)
title head text
0 Albedo Trees Because forests generally have a low albedo, (...
1 Albedo Human activities Human activities (e.g., deforestation, farming...
2 Abraham Lincoln Re-election {{Main|1864 United States presidential electio...
3 Algeria Research and alternative energy sources Algeria has invested an estimated 100 billion ...
print(df[df['Title'] == 'Algeria']['Text'])
paying researchers.
Ecological anthropology is defined as the ""study of [[cultural adaptation]]s to environments""
"TITULO: Agricultural science SUBTITULO Y PARRAFO: ==Fields or related disciplines==
{{Col-begin}}
{{Col-break}}
* [[Agricultural biotechnology]]
* [[Agricultural chemistry]]
* [[Agricultural diversification]]
* [[Agricultural education]]
* [[Agricultural economics]]
* [[Agricultural engineering]]"