Question

我试图清理csv表中的数据，如下所示：

KATY PERRY@katyperry
1,084,149,282,038,820
Justin Bieber@justinbieber
10,527,300,631,674,900,000
Barack Obama@BarackObama
9,959,243,562,511,110,000

我想提取＆＃34; @＆＃34;句柄，例如：

@katyperry
@justinbieber
@BarackObama

这是我给代表的代码，但它所做的只是一遍又一遍地重复表格的第二行：

import csv
import re
with open('C:\\Users\\TK\\Steemit\\Scripts\\twitter.csv', 'rt',  encoding='UTF-8') as inp:
    read = csv.reader(inp)
    for row in read:
        for i in row:
            if i.isalpha():
                stringafterword = re.split('\\@\\',row)[-1]
        print(stringafterword)

Answer 1

如果您愿意使用re，可以在一行中获取字符串列表：

import re

#content string added to make it a working example
content = """KATY PERRY@katyperry
1,084,149,282,038,820
Justin Bieber@justinbieber
10,527,300,631,674,900,000
Barack Obama@BarackObama
9,959,243,562,511,110,000"""

#solution using 're':
m = re.findall('@.*', content)
print(m)

#option without 're' but using string.find() based on your loop:
for row in content.split():
    pos_of_at = row.find('@')
    if pos_of_at > -1: #-1 indicates "substring not found"
        print(row[pos_of_at:])

您当然应该将content字符串替换为文件内容。

Answer 2

首先，“@”符号是一个符号。因此if i.isalpha():将返回False，因为它不是字母字符。你的re.split（）甚至都不会被调用。

试试这个：

import csv
import re
with open('C:\\Users\\input.csv', 'rt',  encoding='UTF-8') as inp:
    read = csv.reader(inp)
    for row in read:
        for i in row:
            stringafterword = re.findall('@.*',i)
        print(stringafterword)

这里我删除了if-condition并将re.split（）索引更改为1，因为那是你想要的部分。

希望它有效。

从CSV表中提取子字符串

2 个答案: