假设我有一个字符串数组:
['Laptop Apple Macbook Air A1465, Core i7, 8Gb, 256Gb SSD, 15"Retina, MacOS' ... 'another device description']
我想从这个描述中提取以下功能:
item=Laptop
brand=Apple
model=Macbook Air A1465
cpu=Core i7
...
我应该先准备预定义的已知功能吗?像
brands = ['apple', 'dell', 'hp', 'asus', 'acer', 'lenovo']
cpu = ['core i3', 'core i5', 'core i7', 'intel pdc', 'core m', 'intel pentium', 'intel core duo']
我不确定我是否需要在这里使用CountVectorizer
和TfidfVectorizer
,更合适的是DictVictorizer
,但如何使用键从整个字符串中提取值来制作词条?
是否可以使用scikit-learn的特征提取?或者我应该制作自己的.fit()
和.transform()
方法吗?
更新: @sergzach,如果我理解你,请复习:
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
for d in data:
for brand in brands:
if brand in d:
# ok brand is found
for model in models:
if model in d:
# ok model is found
为每个功能创建N循环?这可能有效,但不确定它是否正确和灵活。
答案 0 :(得分:0)
是的,就像下一个。
对不起,您可能应该更正下面的代码。
import re
data = ['Laptop Apple Macbook..', 'Laptop Dell Latitude...'...]
features = {
'brand': [r'apple', r'dell', r'hp', r'asus', r'acer', r'lenovo'],
'cpu': [r'core\s+i3', r'core\s+i5', r'core\s+i7', r'intel\s+pdc', r'core\s+m', r'intel\s+pentium', r'intel\s+core\s+duo']
# and other features
}
cat_data = [] # your categories which you should convert into numbers
not_found_columns = []
for line in data:
line_cats = {}
for col, features in features.iteritems():
for i, feature in enumerate(features):
found = False
if re.findall(feature, line.lower(), flags=re.UNICODE) != []:
line_cats[col] = i + 1 # found numeric category in column. For ex., for dell it's 2, for acer it's 5.
found = True
break # current category is determined by a first occurence
# cycle has been end but feature had not been found. Make column value as default not existing feature
if not found:
line_cats[col] = 0
not_found_columns.append((col, line))
cat_data.append(line_cats)
# now we have cat_data where each column is corresponding to a categorial (index+1) if a feature had been determined otherwise 0.
现在您的列名称包含未找到的行(not_found_columns
)。查看它们,可能你忘记了一些功能。
我们也可以将字符串(而不是数字)作为类别编写,然后使用DV
。结果是方法是等价的。
答案 1 :(得分:0)
Scikit Learn的矢量化器将字符串数组转换为反向索引矩阵(2d数组,每个找到的术语/单词都有一列)。原始数组中的每一行(第一维)映射到输出矩阵中的一行。每个单元格将保持计数或重量,具体取决于您使用的矢量化器类型及其参数。
根据您的代码,我不确定这是您所需要的。你能告诉你打算在哪里使用这个你想要的功能吗?你打算训练分类器吗?为了什么目的?