Question

我有一个通常属于以下类型的地址列表：

1000 Currie AV Apt：明尼阿波利斯MN 55403

1843 Polk ST NE Apt：b

1801 3 AV S Apt：203 Minneapolis MN 55404

2900 Thomas AV S Apt：1618 MPLS MN 55416

8409 Elliott AV S Apt：Bloomington MN 55420

我是正则表达式的新手。

我想将Apt:和所有文字替换为第一个带有空白的大写字母。

现在我正在尝试的代码如下：

generate address_home = regexr(address_home1, "(Apt:).*?([A-Z])", " ")

Answer 1

<强>正则表达式：

Apt:[^A-Z\n]*

用一个空格替换匹配的字符。

DEMO

我认为你的代码是，

gen address_home = regexr(address_home1, "Apt:[^A-Z\n]*", " ")

或

gen address_home = regexr(address_home1, "Apt:[^A-Z\\n]*", " ")

_{不知道你是否需要再次逃避反斜杠}。

Answer 2

尝试这样做（替换）：

s/Apt:.*?(?=[A-Z])//g

这适用于使用perl或pcre正则表达式的语言。

s///是基本的替代骨架
Apt: litteral ...
.*?任何事情（非贪婪）......
(?=[A-Z]) 环顾正则表达式技术以匹配UPPER字符但从匹配项中排除

Answer 3

我认为你的正则表达式应该是这样的：

.*(Apt:.*?)([A-Z]).*

你的代码就像这样：

regexr(address_home1, ".*(Apt:.*?)([A-Z]).*", " ")

Answer 4

Stata的正则表达式是not very sophisticated而且我不是正则表达式专家，但这会让你接近：

clear
set more off

*----- example data set -----

input ///
str30 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end

list

*----- what you want -----

gen adr2 = itrim(regexr(adr, "(Apt: *)([a-z0-9]*)", ""))

list

导致：

. list

     +------------------------------------------------------------+
     |                            adr                        adr2 |
     |------------------------------------------------------------|
  1. | 1000 Currie AV Apt: Minneapoli   1000 Currie AV Minneapoli |
  2. |         1843 Polk ST NE Apt: b            1843 Polk ST NE  |
  3. | 1801 3 AV S Apt: 203 Minneapol       1801 3 AV S Minneapol |
  4. | 2900 Thomas AV S Apt: 1618 MPL        2900 Thomas AV S MPL |
  5. | 8409 Elliott AV S Apt: Bloomin   8409 Elliott AV S Bloomin |
     +------------------------------------------------------------+

如果需要，您可以使用其他字符串函数，例如trim()。请参阅help string functions。

Answer 5

正则表达式总是很有用，但这里OP可能并不总是需要它。在这个特定的情况下，函数strpos()和substr()的组合将主要起作用。

例如：

. clear 

input str50 adr
"1000 Currie AV Apt: Minneapolis MN 55403"
"1843 Polk ST NE Apt: b"
"1801 3 AV S Apt: 203 Minneapolis MN 55404"
"2900 Thomas AV S Apt: 1618 MPLS MN 55416"
"8409 Elliott AV S Apt: Bloomington MN 55420"
end


. generate adr2 =  substr(adr, 1, strpos(adr, ":") - 5) + ///
                   substr(adr, strpos(adr, ":") + 1, .)

. list

   +--------------------------------------------------------------------------------------+
   |                                         adr                                     adr2 |
   |--------------------------------------------------------------------------------------|
1. |    1000 Currie AV Apt: Minneapolis MN 55403      1000 Currie AV Minneapolis MN 55403 |
2. |                      1843 Polk ST NE Apt: b                        1843 Polk ST NE b |
3. |   1801 3 AV S Apt: 203 Minneapolis MN 55404     1801 3 AV S 203 Minneapolis MN 55404 |
4. |    2900 Thomas AV S Apt: 1618 MPLS MN 55416      2900 Thomas AV S 1618 MPLS MN 55416 |
5. | 8409 Elliott AV S Apt: Bloomington MN 55420   8409 Elliott AV S Bloomington MN 55420 |
   +--------------------------------------------------------------------------------------+

我们的想法是使用:作为参考点，以便从每个地址中消除子字符串Apt:，因为它的长度始终是恒定的。

修改

@Nick Cox提供了一个类似但更简洁的解决方案：

generate adr3 = subinstr(adr, "Apt: ", "", .)

这只是将Apt:的所有实例替换为""。

替换两个字符串之间的文本

5 个答案: