假设我有一个像这样的亚马逊产品网址
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846
我怎样才能使用javascript抓取ASIN? 谢谢!
答案 0 :(得分:19)
由于ASIN始终是斜线后面的10个字母和/或数字的序列,请尝试以下操作:
url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)")
ASIN之后的额外(?:[/?]|$)
是为了确保只采用完整路径段。
答案 1 :(得分:19)
亚马逊的详细信息页面可以有多种形式,因此要彻底检查它们。这些都是等价的:
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C
它们总是看起来像这样或那样:
http://www.amazon.com/<SEO STRING>/dp/<VIEW>/ASIN
http://www.amazon.com/gp/product/<VIEW>/ASIN
这应该这样做:
var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
var regex = RegExp("http://www.amazon.com/([\\w-]+/)?(dp|gp/product)/(\\w+/)?(\\w{10})");
m = url.match(regex);
if (m) {
alert("ASIN=" + m[4]);
}
答案 2 :(得分:8)
实际上,如果它像amazon.com/BlackBerry那样,那么最佳答案是行不通的......(因为BlackBerry也是10个字符)。
一种解决方法(假设ASIN总是大写,因为它总是从亚马逊获取)是(在Ruby中):
url.match("/([A-Z0-9]{10})")
我发现它可以处理数千个网址。
答案 3 :(得分:4)
以上所有情况都不适用。我尝试过跟踪网址以匹配上面的示例:
http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C
http://www.amazon.com/dp/B0015T963C
http://www.amazon.com/gp/product/B0015T963C
http://www.amazon.com/gp/product/glance/B0015T963C
https://www.amazon.de/gp/product/B00LGAQ7NW/ref=s9u_simh_gw_i1?ie=UTF8&pd_rd_i=B00LGAQ7NW&pd_rd_r=5GP2JGPPBAXXP8935Q61&pd_rd_w=gzhaa&pd_rd_wg=HBg7f&pf_rd_m=A3JWKAKR8XB7XF&pf_rd_s=&pf_rd_r=GA7GB6X6K6WMJC6WQ9RB&pf_rd_t=36701&pf_rd_p=c210947d-c955-4398-98aa-d1dc27e614f1&pf_rd_i=desktop
https://www.amazon.de/Sawyer-Wasserfilter-Wasseraufbereitung-Outdoor-Filter/dp/B00FA2RLX2/ref=pd_sim_200_3?_encoding=UTF8&psc=1&refRID=NMR7SMXJAKC4B3MH0HTN
https://www.amazon.de/Notverpflegung-Kg-Marine-wasserdicht-verpackt/dp/B01DFJTYSQ/ref=pd_sim_200_5?_encoding=UTF8&psc=1&refRID=7QM8MPC16XYBAZMJNMA4
https://www.amazon.de/dp/B01N32MQOA?psc=1
这是我能想到的最好的:(?:[/dp/]|$)([A-Z0-9]{10})
这也将在所有情况下选择前置/。然后可以在以后删除它。
答案 4 :(得分:1)
@Gumbo:你的代码很棒!
// JS测试:将其测试为firebug。
url = window.location.href;
url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");
我添加了一个同样功能的php函数。
function amazon_get_asin_code($url) {
global $debug;
$result = "";
$pattern = "([a-zA-Z0-9]{10})(?:[/?]|$)";
$pattern = escapeshellarg($pattern);
preg_match($pattern, $url, $matches);
if($debug) {
var_dump($matches);
}
if($matches && isset($matches[1])) {
$result = $matches[1];
}
return $result;
}
答案 5 :(得分:1)
这是我的普遍亚马逊ASIN regexp:
~(?:\b)((?=[0-9a-z]*\d)[0-9a-z]{10})(?:\b)~i
答案 6 :(得分:1)
这可能是一种简单的方法,但我还没有发现使用此线程中提供的任何URL的错误,人们说这是一个问题。
简单地说,我取了URL,将其拆分为“/”以获取离散部分。然后循环遍历数组的内容并将它们从正则表达式中弹回。在我的例子中,变量i表示一个对象,该对象具有一个名为RawURL的属性,用于包含我正在使用的原始URL以及我正在填充的名为VendorSKU的属性。
try
{
string[] urlParts = i.RawURL.Split('/');
Regex regex = new Regex(@"^[A-Z0-9]{10}");
foreach (string part in urlParts)
{
Match m = regex.Match(part);
if (m.Success)
{
i.VendorSKU = m.Value;
}
}
}
catch (Exception) { }
到目前为止,这已经完美无缺。
答案 7 :(得分:1)
受到许多答案的启发,我发现
import numpy as np
x # original sample np.array of features
feature_means = np.mean(x, axis=1)
feature_std = np.std(x, axis=1)
random_normal_feature_values = np.random.normal(feature_means, feature_std)
def generate_synthetic_data(sample_dataset, window_mean, window_std, fixed_window=None, variance_range =1 , sythesize_ratio = 2, forced_reverse = False):
synthetic_data = pd.DataFrame(columns=sample_dataset.columns)
synthetic_data.insert(len(sample_dataset.columns), "synthesis_seq", [], True)
for k in range(sythesize_ratio):
if len(synthetic_data) >= len(sample_dataset) * sythesize_ratio:
break;
#this loop generates a set that resembles the entire dataset
country_synthetic = pd.DataFrame(columns=synthetic_data.columns)
if fixed_window != None:
input_sequence_len = fixed_window
else:
input_sequence_len = int(np.random.normal(window_mean, window_std))
#population data change
country_data_i = sample_dataset
if len(country_data_i) < input_sequence_len :
continue
feature_length = configuration['feature_length'] #number of features to be randomized
country_data_array = country_data_i.to_numpy()
country_data_array = country_data_array.T[:feature_length]
country_data_array = country_data_array.reshape(feature_length,len(country_data_i))
x = country_data_array[:feature_length].T
reversed = np.random.normal(0,1)>0
if reversed:
x = x[::-1]
sets =0
x_list = []
dict_x = dict()
for i in range(input_sequence_len):
array_len = ((len(x) -i) - ((len(x)-i)%input_sequence_len))+i
if array_len <= 0:
continue
sets = int( array_len/ input_sequence_len)
if sets <= 0:
continue
x_temp = x[i:array_len].T.reshape(sets,feature_length,input_sequence_len)
uniq_keys = np.array([i+(input_sequence_len*k) for k in range(sets)])
x_temp = x_temp.reshape(feature_length,sets,input_sequence_len)
arrays_split = np.hsplit(x_temp,sets)
dict_x.update(dict(zip(uniq_keys, arrays_split)))
temp_x_list = [dict_x[i].T for i in sorted(dict_x.keys())]
temp_x_list = np.array(temp_x_list).squeeze()
feature_means = np.mean(temp_x_list, axis=1)
feature_std = np.std(temp_x_list, axis=1) /variance_range
random_normal_feature_values = np.random.normal(feature_means, feature_std).T
random_normal_feature_values = np.round(random_normal_feature_values,0)
random_normal_feature_values[random_normal_feature_values < 0] = 0
if reversed:
random_normal_feature_values = random_normal_feature_values.T[::-1]
random_normal_feature_values = random_normal_feature_values.T
for i in range(len(random_normal_feature_values)):
country_synthetic[country_synthetic.columns[i]] = random_normal_feature_values[i]
country_synthetic['synthesis_seq'] = k
synthetic_data = synthetic_data.append(country_synthetic, ignore_index=True)
return synthetic_data
for i in range(1):
directory_name = '/synthetic_'+str(i)
mypath = source_path+ '/cleaned'+directory_name
if os.path.exists(mypath) == False:
os.mkdir(mypath)
data = generate_synthetic_data(original_data, window_mean = 0, window_std= 0, fixed_window=2 ,variance_range = 10**i, sythesize_ratio = 1)
synthetic_data.append(data)
#data.to_csv(mypath+'/synthetic_'+str(i)+'_dt31_05_.csv', index=False )
print('synth step : ', i, ' len : ', len(synthetic_data))
非常适合从URL中的任何位置提取asin。您可以在这里尝试。 https://regexr.com/56jm7
edit:添加了字符串结尾作为停止检查之一。在python中使用正则表达式时需要使用
答案 8 :(得分:1)
答案 9 :(得分:0)
这样的事情应该有效(未经测试)
var match = /\/dp\/(.*?)\/ref=amb_link/.exec(amazon_url);
var asin = match ? match[1] : '';
答案 10 :(得分:0)
Wikipedia article on ASIN(我在您的问题中已链接)提供了各种形式的Amazon网址。您可以使用match()
方法轻松创建正则表达式(或一系列正则表达式)来获取此数据。
答案 11 :(得分:0)
对第一个答案的正则表达式进行了一些改动,它适用于我测试过的所有网址。
var url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C";
m = url.match("/([a-zA-Z0-9]{10})(?:[/?]|$)");;
print(m);
if (m) {
print("ASIN=" + m[1]);
}
答案 12 :(得分:0)
答案 13 :(得分:0)
这对我来说非常有效,我尝试了此页面上的所有链接以及其他一些链接:
function ExtractASIN(url){
var ASINreg = new RegExp(/(?:\/)([A-Z0-9]{10})(?:$|\/|\?)/);
var cMatch = url.match(ASINreg);
if(cMatch == null){
return null;
}
return cMatch[1];
}
ExtractASIN('http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C/ref=amb_link_86123711_2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-1&pf_rd_r=0AY9N5GXRYHCADJP5P0V&pf_rd_t=101&pf_rd_p=500528151&pf_rd_i=507846');
答案 14 :(得分:0)
您可以使用XPath在搜索结果的data-asin
属性中scrape ASIN codes。
例如$x('//@data-asin').map(function(v,i){return v.nodeValue})
可以在Chrome的控制台中运行。
答案 15 :(得分:-2)
如果ASIN始终位于URL中的该位置:
var asin= decodeURIComponent(url.split('/')[5]);
尽管ASIN获得%-escaped的可能性很小。