我正在分析StackOverflow的转储文件" Posts.Small.xml"使用pySpark。我想分开代码块'来自' text'连续。典型的解析行如下所示:
['[u"<p>I want to use a track-bar to change a form\'s opacity.</p>


<p>This is my code:</p>

<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>


<p>When I try to build it, I get this error:</p>

<blockquote>
 <p>Cannot implicitly convert type \'decimal\' to \'double\'.
</p>
</blockquote>

<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.',
'", u\'This code has worked fine for me in VB.NET in the past.',
'\', u"</p>
 When setting a form\'s opacity should I use a decimal or double?"]']
我已经尝试过&#34; itertools&#34;和一些python函数,但无法得到结果。 我提取上述行的初始代码是:
postsXml = textFile.filter( lambda line: not line.startswith("<?xml version=")
postsRDD = postsXml.map(............)
tokensentRDD = postsRDD.map(lambda x:(x[0], nltk.sent_tokenize(x[3])))
new = tokensentRDD.map(lambda x: x[1]).take(1)
a = ''.join(map(str,new))
b = a.replace("<", "<")
final = b.replace(">", ">")
nltk.sent_tokenize(final)
感谢任何想法!
答案 0 :(得分:2)
您可以使用XPath提取code
内容(lxml
库将有所帮助),然后提取文本内容,选择其他所有内容,例如:
import lxml.etree
data = '''<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p> <pre><code>decimal trans = trackBar1.Value / 5000; this.Opacity = trans;</code></pre>
<p>When I try to build it, I get this error:</p>
<p>Cannot implicitly convert type 'decimal' to 'double'.</p>
<p>I tried making <code>trans</code> a <code>double</code>.</p>'''
html = lxml.etree.HTML(data)
code_blocks = html.xpath('//code/text()')
text_blocks = html.xpath('//*[not(descendant-or-self::code)]/text()')
答案 1 :(得分:0)
最简单的方法可能是将正则表达式应用于文本,匹配标记“' and '
”。这将使您能够找到代码块。但是你不会说你之后会怎么做。所以...
from itertools import zip_longest
sample_paras = [
"""<p>I want to use a track-bar to change a form\'s opacity.</p>

<p>This is my code:</p>

<pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
</code></pre>

<p>When I try to build it, I get this error:</p>

<blockquote>
 <p>Cannot implicitly convert type \'decimal\' to \'double\'. </p>
</blockquote>

<p>I tried making <code>trans</code> a <code>double</code>, but then the control doesn\'t work.""",
"""This code has worked fine for me in VB.NET in the past.""",
"""</p>
 When setting a form\'s opacity should I use a decimal or double?""",
]
single_block = " ".join(sample_paras)
import re
separate_code = re.split(r"</?code>", single_block)
text_blocks, code_blocks = zip(*zip_longest(*[iter(separate_code)] * 2))
print("Text:\n")
for t in text_blocks:
print("--")
print(t)
print("\n\nCode:\n")
for t in code_blocks:
print("--")
print(t)