我现在正在R中学习正则表达式。我有一个示例数据框,如下所示:
df$col2 <- str_extract(df$col1, '([[:digit:]]*\\/?\\.?\\-?[[:digit:]]+[[:space:]]+(in|ft)\\.[[:space:]]*x*)')
我想从col1中提取所有测量数据并将其添加到col2。 我尝试了以下方法:
col1 col2
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup 1/2 in. x
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece 60 in. x
3 2-3/4 in. x 4-1/2 in. Heavy-Duty 3/4 in. x
4 1/4-20 x 2 in. Forged Steel 2 in.
5 1/2-Amp Slo-Blo GMA Fuse <NA>
6 3/4 in. x 12 in. x 24 in. White Thermally 3/4 in. x
7 12.0 oz. of weight <NA>
8 1.4 fl. oz. of liquid <NA>
9 14 gal. tall <NA>
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height 47 in.
11 1/25 HP Cast Iron <NA>
12 1/2 in., 3/4 in. and 1 in. PVC 1/2 in.
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor 3/4 in. x
14 8 oz. -200 Pot Of Cream <NA>
15 5/8 in. dia. x 25 ft. Water 5/8 in.
16 18.5 / 30.5 in. Brushed Nickel 30.5 in.
17 57-1/2 in. x 70-5/16 in. Semi-Framed 1/2 in. x
18 2-1/4 HP Router <NA>
19 12-Volt Lithium-Ion Cordless 3/8 in. 3/8 in.
20 12-Gauge 24-5/8 in. Strap 5/8 in.
21 7-3/4 in. Wigan Ceiling 3/4 in.
22 1 qt. B-I-N <NA>
23 3/8 in. O.D. x 1/4 in. NPTF 3/8 in.
24 2-1/2 in. Long x 5/8 in. Diameter Spring 1/2 in.
25 1/4 x 3 in. Heat-Shrink 3 in.
26 4-White PVC End <NA>
27 41000 Series Non-Vented Range <NA>
28 Revival 1-Spray 5-Katalyst Air <NA>
29 180-Degree White Outdoor <NA>
30 3/8 x 3 Hand Scraped <NA>
31 67-Qt. Jug <NA>
32 35-77-7/8 in. White 7/8 in.
33 -16 tpi x 4 in. Stainless Steel 4 in.
34 3-21 degree Full <NA>
结果如下:
df
col1 col2
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup 1/2 in. x 1/2 in. x 3/4 ft.
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece 60 in. x 43-1/2 in. x 54-1/4 in.
3 2-3/4 in. x 4-1/2 in. Heavy-Duty 2-3/4 in. x 4-1/2 in.
4 1/4-20 x 2 in. Forged Steel 1/4-20 x 2 in.
5 1/2-Amp Slo-Blo GMA Fuse 1/2-Amp
6 3/4 in. x 12 in. x 24 in. White Thermally 3/4 in. x 12 in. x 24 in.
7 12.0 oz. of weight 12.0 oz.
8 1.4 fl. oz. of liquid 1.4 fl. oz.
9 14 gal. tall 14 gal.
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height
11 1/25 HP Cast Iron 1/25 HP
12 1/2 in., 3/4 in. and 1 in. PVC 1/2 in., 3/4 in. and 1 in.
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor 24-3/4 in. x 48-3/4 in. x 1-1/4 in.
14 8 oz. -200 Pot Of Cream 8 oz.
15 5/8 in. dia. x 25 ft. Water 5/8 in. dia. x 25 ft.
16 18.5 / 30.5 in. Brushed Nickel 18.5 / 30.5 in.
17 57-1/2 in. x 70-5/16 in. Semi-Framed 57-1/2 in. x 70-5/16 in.
18 2-1/4 HP Router 2-1/4 HP
19 12-Volt Lithium-Ion Cordless 3/8 in. 12-Volt 3/8 in.
20 12-Gauge 24-5/8 in. Strap 12-Gauge 24-5/8 in.
21 7-3/4 in. Wigan Ceiling 7-3/4 in.
22 1 qt. B-I-N 1 qt.
23 3/8 in. O.D. x 1/4 in. NPTF 3/8 in. O.D. x 1/4 in.
24 2-1/2 in. Long x 5/8 in. Diameter Spring 2-1/2 in. Long x 5/8 in. Diameter
25 1/4 x 3 in. Heat-Shrink 1/4 x 3 in.
26 4-White PVC End
27 41000 Series Non-Vented Range
28 Revival 1-Spray 5-Katalyst Air 1-Spray 5-Katalyst
29 180-Degree White Outdoor 180-Degree
30 3/8 x 3 Hand Scraped 3/8 x 3
31 67-Qt. Jug 67-Qt.
32 35-77-7/8 in. White 35-77-7/8 in.
33 -16 tpi x 4 in. Stainless Steel -16 tpi x 4 in.
34 3-21 degree Full 3-21 degree
但是,我希望看到如下结果:
df$col2 = paste(str_extract_all(df$col1, '([[:digit:]]*\\.?\\/?[[:digit:]]+[[:space:]]+(in|ft|cu\\.[[:space:]]+ft)\\.[[:space:]]*[WHD]*[[:space:]]+x*[[:space:]]*)+'), collapse = ' ')]
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2], '[[:digit:]]*\\.?\\/?[[:digit:]]+[[:space:]]*\\-*(oz|lb|gal|Gal)\\.'), collapse = ' ')
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2],'[[:digit:]]*\\.?\\,?[[:digit:]]+\\-?[[:space:]]*(Watt|Pack|Gauge|piece|Piece|Panel|mph|MPH|cc|Ton|ton|Light|Gang|LED|Volt|amp|BTU|Amp|Drawer|Step|Tier|Cycle)[[:space:]]+'), collapse = ' ')
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2],,'[[:digit:]]*\\.?\\,?[[:digit:]]+[[:space:]]+sq\\.[[:space:]]+ft\\.'), collapse = ' ')
我不确定如何调整我的正则表达式来处理所有情况?我也尝试将我的正则表达式分成多行,但也没有多大帮助。
以下是我试过的方法:
&(buffer[0])
然而,我没有得到我想要的结果。
你有任何意见吗?
谢谢!