可以处理变量结构的XPath选择器

时间:2016-11-18 03:05:52

标签: html xpath scrapy

我需要使用XPath选择器提取一些文本。文本可以有3种不同的形式:

<td>
    TARGET_TEXT
</td>

<td>
    <p>
        TARGET_TEXT
    </p>
</td>

<td>
    <p>
        <strong>TARGET_TEXT</strong>
    </p>
</td>

我是否可以使用XPath语句/选择器来处理所有这三种情况?或者是否可以在XPath选择器中添加OR语句?

for tr in table_rows:
    # only handles case 1
    topic_name = tr.xpath('.//td[1]/text()').extract()[0]

3 个答案:

答案 0 :(得分:1)

这个XPath,

expertise

将返回相同的空格规范化字符串值from sklearn.feature_extraction.text import HashingVectorizer from scipy.sparse import hstack from sklearn.cluster import KMeans docs = [doc_a,doc_b, doc_c, doc_d, doc_e] # vectorize both fields separately category_vectorizer = HashingVectorizer() Xc = category_vectorizer.fit_transform([doc["category"] for doc in docs]) expertise_vectorizer = HashingVectorizer() Xe = expertise_vectorizer.fit_transform([doc["expertise"] for doc in docs]) # combine the features into a single data set X = hstack((Xc,Xe)) print("X: %d x %d" % X.shape) print("Xc: %d x %d" % Xc.shape) print("Xe: %d x %d" % Xe.shape) # fit a cluster model km = KMeans(n_clusters=2) # predict the cluster for k,v in zip(["a","b","c","d", "e"], km.fit_predict(X)): print("%s is in cluster %d" % (k,v))

normalize-space(/td)

所有三个案例。

有关XPath中字符串值的更多信息,请参阅Testing text() nodes vs string values in XPath

答案 1 :(得分:0)

for tr in table_rows:

    all_three = tr.xpath('.//td//text()').extract()

答案 2 :(得分:-1)

看起来以下就足够了:

for tr in table_rows:
    topic_name = tr.xpath('.//td[1]//text()').extract()
    # topic_name can be ['\r\n', 'TARGET_TEXT', '\r\n']
    topic_name = ''.join(topic_name)