我有一个像这样的网址列表:
mydata <- read.table(header=TRUE, text="
Id
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrickpattern%3ADecorative%2FArt+Deco%3Abrickpattern%3AFloral%3Abrickpattern%3AGeometric%3Abrickpattern%3AGraphic%3Abrickpattern%3ATropical%3Aprice%3A300%2C10500&page=7&gridValue=4
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2040%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3ABlue%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830316016?q=%3Arelevance%3Averticalcolorfamily%3AWhite&gclid=CjwKEAjw9_jJBRCXycSarr3csWcSJABthk07W_H0RxQtOPZX7VdD9CSmK4S01BMYdXbtc0XxC0OeChoCky_w_wcB
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AFLYING%20MACHINE%3Abrand%3AMUFTI%3Abrand%3AUNITED%20COLORS%20OF%20BENETTON
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2038%3Averticalsizegroupformat%3AIN%2039%3Averticalsizegroupformat%3AIN%20M%3Averticalsizegroupformat%3AUK%2039%3Averticalsizegroupformat%3AUK%20M%3Averticalsizegroupformat%3AUK%20S%3Averticalsizegroupformat%3AUS%20M%3Averticalsizegroupformat%3AUS%20S%3Abrickpattern%3ASolid%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830216013?q=%3Aprce-asc%3Abricksleeve%3AShort%3Aprice%3A300%2C10500&page=2&gridValue=4
https://www.example.com/dp/c/830216013??q=%3Aprce-asc%3Abrand%3AUS+POLO%3Abricksleeve%3AShort%3Aprice%3A300%2C10500
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AAJIO%3Abrand%3ABASICS%3Abrand%3ACelio%3Abrand%3ADNMX%3Abrand%3AGAS%3Abrand%3ALEVIS%3Abrand%3ANETPLAY%3Abrand%3ASIN%3Abrand%3ASUPERDRY%3Abrand%3AUS%20POLO%3Abrand%3AVIMAL%3Abrand%3AVIMAL%20APPARELS%3Abrand%3AVOI%20JEANS
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3ABritish+Club%3Abrand%3ACelio%3Abrand%3AFLYING+MACHINE%3Aprice%3A300%2C10500&page=1&gridValue=4
")
我需要从网址中提取品牌,verticalcolorfamily,q =等参数的值。这些参数是网站上应用的过滤器 我正在寻找的输出是一个包含三列的数据框:参数,值和值的出现频率。对于Ex:
parameter | value | frequency
----------|----------------|----------
brand | FLYING+MACHINE | 2
q= | relevance | 5
price | 300%2C10500 | 2
brand | BASICS | 1
目前我能够想到的是将每个网址收集为字符串向量,这些字符向量由交替的值&#34;%3A&#34;作为分隔符:[q =%3A相关,brickpattern%3ADecorative%2FArt + Deco,brickpattern%3AFloral,brickpattern%3AGeometric,brickpattern%3AGraphic,brickpattern%3ATropical,price%3A300%2C10500]。
然后将每个元素放在数据框的一列中,然后再次按&#39;%3A&#39;然后做一个小组。 对其他方法的建议将非常感激。 此外,如果我应该使用这种方法,我不知道使用交替&#39;%3A&#39;作为分隔符。
答案 0 :(得分:1)
urltools
看起来像是一个很棒的包,可以满足您的需求。在此期间,这是一个被黑的回答。从您的data.frame开始:
# Convert to character list
# Get rid of url
# Split by "%3A" and convert to "long" list
L <- as.character(mydata$Id)
L <- gsub("https://www.example.com/dp/c/830216013\\?", "", L)
L <- unlist(strsplit(L, "%3A"))
head(L)
[1] "q=" "relevance" "brickpattern"
[4] "Decorative%2FArt+Deco" "brickpattern" "Floral"
然后:
# Convert to 2-column data frame
# Count unique parameter:value pairs
df <- data.frame(parameter = L[seq(1,length(L),2)], value = L[seq(2,length(L),2)]) %>%
group_by(parameter, value) %>%
summarize(frequency=sum(!is.na(value)))
我只会在frequency >= 2
:
# Show only entries with frequency >= 2
filter(df, frequency >= 2)
parameter value frequency
<fctr> <fctr> <int>
1 brand Celio 2
2 bricksleeve Short 2
3 q= relevance 6
4 verticalcolorfamily Black 2
5 verticalcolorfamily White 2
请注意brand::FLYING+MACHINE != 2
,因为FLYING+MACHINE
出现在FLYING%20MACHINE
和FLYING+MACHINE
。