我需要使用XPath选择器提取一些文本。文本可以有3种不同的形式:
<td>
TARGET_TEXT
</td>
<td>
<p>
TARGET_TEXT
</p>
</td>
<td>
<p>
<strong>TARGET_TEXT</strong>
</p>
</td>
我是否可以使用XPath语句/选择器来处理所有这三种情况?或者是否可以在XPath选择器中添加OR
语句?
for tr in table_rows:
# only handles case 1
topic_name = tr.xpath('.//td[1]/text()').extract()[0]
答案 0 :(得分:1)
这个XPath,
expertise
将返回相同的空格规范化字符串值from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import hstack
from sklearn.cluster import KMeans
docs = [doc_a,doc_b, doc_c, doc_d, doc_e]
# vectorize both fields separately
category_vectorizer = HashingVectorizer()
Xc = category_vectorizer.fit_transform([doc["category"] for doc in docs])
expertise_vectorizer = HashingVectorizer()
Xe = expertise_vectorizer.fit_transform([doc["expertise"] for doc in docs])
# combine the features into a single data set
X = hstack((Xc,Xe))
print("X: %d x %d" % X.shape)
print("Xc: %d x %d" % Xc.shape)
print("Xe: %d x %d" % Xe.shape)
# fit a cluster model
km = KMeans(n_clusters=2)
# predict the cluster
for k,v in zip(["a","b","c","d", "e"], km.fit_predict(X)):
print("%s is in cluster %d" % (k,v))
,
normalize-space(/td)
所有三个案例。
有关XPath中字符串值的更多信息,请参阅Testing text() nodes vs string values in XPath。
答案 1 :(得分:0)
for tr in table_rows:
all_three = tr.xpath('.//td//text()').extract()
答案 2 :(得分:-1)
看起来以下就足够了:
for tr in table_rows:
topic_name = tr.xpath('.//td[1]//text()').extract()
# topic_name can be ['\r\n', 'TARGET_TEXT', '\r\n']
topic_name = ''.join(topic_name)