从文本中解析代码 - Python

时间:2016-01-29 23:33:29

标签: python apache-spark

我正在分析StackOverflow的转储文件" Posts.Small.xml"使用pySpark。我想分开代码块'来自' text'连续。典型的解析行如下所示:

            ['[u"<p>I want to use a track-bar to change a form\'s opacity.</p>&#xA;&#xA;
        <p>This is my code:</p>&#xA;&#xA;<pre><code>decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;</code></pre>&#xA;&#xA;
    <p>When I try to build it, I get this error:</p>&#xA;&#xA;<blockquote>&#xA;  <p>Cannot implicitly convert type \'decimal\' to \'double\'.
</p>&#xA;</blockquote>&#xA;&#xA;<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.',
             '", u\'This code has worked fine for me in VB.NET in the past.',
             '\', u"</p>&#xA; When setting a form\'s opacity should I use a decimal or double?"]']

我已经尝试过&#34; itertools&#34;和一些python函数,但无法得到结果。 我提取上述行的初始代码是:

postsXml = textFile.filter( lambda line: not line.startswith("<?xml version=")
postsRDD = postsXml.map(............)
tokensentRDD = postsRDD.map(lambda x:(x[0], nltk.sent_tokenize(x[3])))
new = tokensentRDD.map(lambda x: x[1]).take(1)
a = ''.join(map(str,new))
b = a.replace("&lt;", "<")
final = b.replace("&gt;", ">")
nltk.sent_tokenize(final)

感谢任何想法!

2 个答案:

答案 0 :(得分:2)

您可以使用XPath提取code内容(lxml库将有所帮助),然后提取文本内容,选择其他所有内容,例如:

import lxml.etree


data = '''<p>I want to use a track-bar to change a form's opacity.</p>
          <p>This is my code:</p> <pre><code>decimal trans = trackBar1.Value / 5000; this.Opacity = trans;</code></pre>
          <p>When I try to build it, I get this error:</p>
          <p>Cannot implicitly convert type 'decimal' to 'double'.</p>
          <p>I tried making <code>trans</code> a <code>double</code>.</p>'''

html = lxml.etree.HTML(data)
code_blocks = html.xpath('//code/text()')
text_blocks = html.xpath('//*[not(descendant-or-self::code)]/text()') 

答案 1 :(得分:0)

最简单的方法可能是将正则表达式应用于文本,匹配标记“' and '”。这将使您能够找到代码块。但是你不会说你之后会怎么做。所以...

from itertools import zip_longest

sample_paras = [
    """<p>I want to use a track-bar to change a form\'s opacity.</p>&#xA;&#xA;<p>This is my code:</p>&#xA;&#xA;<pre><code>decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;</code></pre>&#xA;&#xA;<p>When I try to build it, I get this error:</p>&#xA;&#xA;<blockquote>&#xA;  <p>Cannot implicitly convert type \'decimal\' to \'double\'. </p>&#xA;</blockquote>&#xA;&#xA;<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.""",
    """This code has worked fine for me in VB.NET in the past.""",
    """</p>&#xA; When setting a form\'s opacity should I use a decimal or double?""",
]

single_block = " ".join(sample_paras)

import re
separate_code = re.split(r"</?code>", single_block)

text_blocks, code_blocks = zip(*zip_longest(*[iter(separate_code)] * 2))

print("Text:\n")
for t in text_blocks:
    print("--")
    print(t)

print("\n\nCode:\n")
for t in code_blocks:
    print("--")
    print(t)