Question

我正在尝试在抓取Twitter时返回用户位置数据。我正在使用正则表达式，特别是，我希望从输出中排除“\ n”。

当前正则表达式：

data = open("user_locations.txt", "r")
valid_ex = re.compile(r'([A-Z][a-z]+), ([A-Za-z]+[^\n])')

user_locations.txt：

California, USA
You are your own ExclusiveLogo
Around The World
Galatasaray
★DM 4 PROMO / CONTENT REMOVAL★
Glasgow, Scotland
United States
Berlin, Germany
Global

预期产出：

['California, USA', 'Glasgow, Scotland', 'Berlin, Germany']

实际输出：

['California, USA\n', 'Glasgow, Scotland\n', 'Berlin, Germany\n']

预期与实际输出之间出现差异的另一个原因可能是我在打印列表时使用search（）的方式。那就是：

for line in data:
    result = valid_ex.search(line)
    if result:
        locations_list.append(line)
    print(locations_list)

谢谢，任何帮助将不胜感激！：）

Answer 1

找到匹配项后，请致电locations_list.append(line)。这会附加整行（包括换行符），而不仅仅是匹配的内容。

以下是一些可以获得所需结果的选项：

选项1

将locations_list.append(line)更改为locations_list.append(line.strip())

选项2

取代所需匹配的结果：

with open('test.txt') as f:
    print(re.findall(r'[A-Z][a-z]+, [A-Za-z]+', f.read()))

输出：

['California, USA', 'Glasgow, Scotland', 'Berlin, Germany']

Answer 2

您是否考虑过使用str.strip()删除尾随换行符？

Answer 3

一个简单的解决方案是用一个空格替换所有连续的空白字符。

text = re.sub(r'\s+', ' ', text)

从文件读取输入时排除\ n

3 个答案: