使用Python 3

时间:2015-08-21 02:23:55

标签: parsing python-3.x text wikitext

我试图解析一些wikitext。这是我需要解析的文本示例:

== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...

这里的结构并不复杂:
标题我相信整个文档中至少有一个title 子主题是可选的 元素每个主题/子主题必须至少有一个 子元素是可选的,可以重复

如果sub-elements重复,我打算使用\ln统一它们。

我想要做的是将其解析成以下结构的词典:

{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}

你知道任何pythonic方式或想法将其解析为我想要的吗?我非常感谢你的时间。

PS。这是我尝试解析和提取引号的完整文件: Woody Allen

1 个答案:

答案 0 :(得分:0)

你说"报价"但你链接了维基百科。你的意思是Wikiquote吗?

无论如何,你一定不能自己解析wiki文本。您可以通过parse API访问Python client来实现您的目标。

例如,他的Wikiquote文章https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections上的部分列表(即引用的作品):

{
    "parse": {
        "title": "Woody Allen",
        "pageid": 80,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes",
                "number": "1",
                "index": "1",
                "fromtitle": "Woody_Allen",
                "byteoffset": 657,
                "anchor": "Quotes"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Getting Even</i> (1971)",
                "number": "1.1",
                "index": "2",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11322,
                "anchor": "Getting_Even_.281971.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "<i>My Philosophy</i>",
                "number": "1.1.1",
                "index": "3",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11471,
                "anchor": "My_Philosophy"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Everything You Always Wanted to Know About Sex* (*But Were Afraid to Ask)</i> (1972)",
                "number": "1.2",
                "index": "4",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11814,
                "anchor": "Everything_You_Always_Wanted_to_Know_About_Sex.2A_.28.2ABut_Were_Afraid_to_Ask.29_.281972.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Sleeper</i> (1973)",
                "number": "1.3",
                "index": "5",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12364,
                "anchor": "Sleeper_.281973.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Love and Death</i> (1975)",
                "number": "1.4",
                "index": "6",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12858,
                "anchor": "Love_and_Death_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Without Feathers</i> (1975)",
                "number": "1.5",
                "index": "7",
                "fromtitle": "Woody_Allen",
                "byteoffset": 14090,
                "anchor": "Without_Feathers_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Annie Hall</i> (1977)",
                "number": "1.6",
                "index": "8",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16485,
                "anchor": "Annie_Hall_.281977.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Side Effects</i> (1980)",
                "number": "1.7",
                "index": "9",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16899,
                "anchor": "Side_Effects_.281980.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "My Apology",
                "number": "1.7.1",
                "index": "10",
                "fromtitle": "Woody_Allen",
                "byteoffset": 17529,
                "anchor": "My_Apology"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Manhattan Murder Mystery</i> (1993)",
                "number": "1.8",
                "index": "11",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18579,
                "anchor": "Manhattan_Murder_Mystery_.281993.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Don't Drink the Water</i> (1994)",
                "number": "1.9",
                "index": "12",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18960,
                "anchor": "Don.27t_Drink_the_Water_.281994.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Deconstructing Harry</i> (1997)",
                "number": "1.10",
                "index": "13",
                "fromtitle": "Woody_Allen",
                "byteoffset": 19228,
                "anchor": "Deconstructing_Harry_.281997.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Standup Comic</i> (1999)",
                "number": "1.11",
                "index": "14",
                "fromtitle": "Woody_Allen",
                "byteoffset": 21289,
                "anchor": "Standup_Comic_.281999.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Mere Anarchy</i> (2007)",
                "number": "1.12",
                "index": "15",
                "fromtitle": "Woody_Allen",
                "byteoffset": 22463,
                "anchor": "Mere_Anarchy_.282007.29"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Attributed",
                "number": "2",
                "index": "16",
                "fromtitle": "Woody_Allen",
                "byteoffset": 24181,
                "anchor": "Attributed"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Others",
                "number": "3",
                "index": "17",
                "fromtitle": "Woody_Allen",
                "byteoffset": 25045,
                "anchor": "Others"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes about Allen",
                "number": "4",
                "index": "18",
                "fromtitle": "Woody_Allen",
                "byteoffset": 27525,
                "anchor": "Quotes_about_Allen"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "External links",
                "number": "5",
                "index": "19",
                "fromtitle": "Woody_Allen",
                "byteoffset": 29106,
                "anchor": "External_links"
            }
        ]
    }
}