我试图解析一些wikitext
。这是我需要解析的文本示例:
== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...
这里的结构并不复杂:
标题我相信整个文档中至少有一个title
子主题是可选的
元素每个主题/子主题必须至少有一个
子元素是可选的,可以重复
如果sub-elements
重复,我打算使用\ln
统一它们。
我想要做的是将其解析成以下结构的词典:
{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}
你知道任何pythonic方式或想法将其解析为我想要的吗?我非常感谢你的时间。
PS。这是我尝试解析和提取引号的完整文件: Woody Allen
答案 0 :(得分:0)
你说"报价"但你链接了维基百科。你的意思是Wikiquote吗?
无论如何,你一定不能自己解析wiki文本。您可以通过parse
API访问Python client来实现您的目标。
例如,他的Wikiquote文章https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections上的部分列表(即引用的作品):
{
"parse": {
"title": "Woody Allen",
"pageid": 80,
"sections": [
{
"toclevel": 1,
"level": "2",
"line": "Quotes",
"number": "1",
"index": "1",
"fromtitle": "Woody_Allen",
"byteoffset": 657,
"anchor": "Quotes"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Getting Even</i> (1971)",
"number": "1.1",
"index": "2",
"fromtitle": "Woody_Allen",
"byteoffset": 11322,
"anchor": "Getting_Even_.281971.29"
},
{
"toclevel": 3,
"level": "4",
"line": "<i>My Philosophy</i>",
"number": "1.1.1",
"index": "3",
"fromtitle": "Woody_Allen",
"byteoffset": 11471,
"anchor": "My_Philosophy"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Everything You Always Wanted to Know About Sex* (*But Were Afraid to Ask)</i> (1972)",
"number": "1.2",
"index": "4",
"fromtitle": "Woody_Allen",
"byteoffset": 11814,
"anchor": "Everything_You_Always_Wanted_to_Know_About_Sex.2A_.28.2ABut_Were_Afraid_to_Ask.29_.281972.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Sleeper</i> (1973)",
"number": "1.3",
"index": "5",
"fromtitle": "Woody_Allen",
"byteoffset": 12364,
"anchor": "Sleeper_.281973.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Love and Death</i> (1975)",
"number": "1.4",
"index": "6",
"fromtitle": "Woody_Allen",
"byteoffset": 12858,
"anchor": "Love_and_Death_.281975.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Without Feathers</i> (1975)",
"number": "1.5",
"index": "7",
"fromtitle": "Woody_Allen",
"byteoffset": 14090,
"anchor": "Without_Feathers_.281975.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Annie Hall</i> (1977)",
"number": "1.6",
"index": "8",
"fromtitle": "Woody_Allen",
"byteoffset": 16485,
"anchor": "Annie_Hall_.281977.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Side Effects</i> (1980)",
"number": "1.7",
"index": "9",
"fromtitle": "Woody_Allen",
"byteoffset": 16899,
"anchor": "Side_Effects_.281980.29"
},
{
"toclevel": 3,
"level": "4",
"line": "My Apology",
"number": "1.7.1",
"index": "10",
"fromtitle": "Woody_Allen",
"byteoffset": 17529,
"anchor": "My_Apology"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Manhattan Murder Mystery</i> (1993)",
"number": "1.8",
"index": "11",
"fromtitle": "Woody_Allen",
"byteoffset": 18579,
"anchor": "Manhattan_Murder_Mystery_.281993.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Don't Drink the Water</i> (1994)",
"number": "1.9",
"index": "12",
"fromtitle": "Woody_Allen",
"byteoffset": 18960,
"anchor": "Don.27t_Drink_the_Water_.281994.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Deconstructing Harry</i> (1997)",
"number": "1.10",
"index": "13",
"fromtitle": "Woody_Allen",
"byteoffset": 19228,
"anchor": "Deconstructing_Harry_.281997.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Standup Comic</i> (1999)",
"number": "1.11",
"index": "14",
"fromtitle": "Woody_Allen",
"byteoffset": 21289,
"anchor": "Standup_Comic_.281999.29"
},
{
"toclevel": 2,
"level": "3",
"line": "<i>Mere Anarchy</i> (2007)",
"number": "1.12",
"index": "15",
"fromtitle": "Woody_Allen",
"byteoffset": 22463,
"anchor": "Mere_Anarchy_.282007.29"
},
{
"toclevel": 1,
"level": "2",
"line": "Attributed",
"number": "2",
"index": "16",
"fromtitle": "Woody_Allen",
"byteoffset": 24181,
"anchor": "Attributed"
},
{
"toclevel": 1,
"level": "2",
"line": "Others",
"number": "3",
"index": "17",
"fromtitle": "Woody_Allen",
"byteoffset": 25045,
"anchor": "Others"
},
{
"toclevel": 1,
"level": "2",
"line": "Quotes about Allen",
"number": "4",
"index": "18",
"fromtitle": "Woody_Allen",
"byteoffset": 27525,
"anchor": "Quotes_about_Allen"
},
{
"toclevel": 1,
"level": "2",
"line": "External links",
"number": "5",
"index": "19",
"fromtitle": "Woody_Allen",
"byteoffset": 29106,
"anchor": "External_links"
}
]
}
}