Question

我的csv编写器目前没有逐行生成它只是乱七八糟。任何帮助都会很棒，基本上我需要csv在下面的一个colulmn中的产量部分有4行。

tweets_df=tweets_df.dropna()

 for i in tweets_df.ix[:,0]:
    if regex_getter(i) != None:
        print(regex_getter(i))

产量

Burlington, VT
Minneapolis, MN
Bloomington, IN
Irvine, CA


 with open('Bernie.csv', 'w') as mycsvfile: 
for i in tweets_df.ix[:,0]:
if regex_getter(i) != None:
    row = regex_getter(i)
    writer.writerow([i])




def regex_getter(entry):

txt = entry

re1='((?:[a-z][a-z]+))' # Word 1
re2='(,)'   # Any Single Character 1
re3='(\\s+)'    # White Space 1
re4='((?:(?:AL)|(?:AK)|(?:AS)|(?:AZ)|(?:AR)|(?:CA)|(?:CO)|(?:CT)|(?:DE)|(?:DC)|(?:FM)|(?:FL)|(?:GA)|(?:GU)|(?:HI)|(?:ID)|(?:IL)|(?:IN)|(?:IA)|(?:KS)|(?:KY)|(?:LA)|(?:ME)|(?:MH)|(?:MD)|(?:MA)|(?:MI)|(?:MN)|(?:MS)|(?:MO)|(?:MT)|(?:NE)|(?:NV)|(?:NH)|(?:NJ)|(?:NM)|(?:NY)|(?:NC)|(?:ND)|(?:MP)|(?:OH)|(?:OK)|(?:OR)|(?:PW)|(?:PA)|(?:PR)|(?:RI)|(?:SC)|(?:SD)|(?:TN)|(?:TX)|(?:UT)|(?:VT)|(?:VI)|(?:VA)|(?:WA)|(?:WV)|(?:WI)|(?:WY)))(?![a-z])'   # US State 1

rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
    word1=m.group(1)
    c1=m.group(2)
    ws1=m.group(3)
    usstate1=m.group(4)
    return str((word1 +  c1 +ws1 + usstate1))

我的数据看起来没有正则表达式方法，它基本上取出了所有非City，State格式的数据。例如，它排除了所有不像罗利，北卡罗来纳州的东西。

for i in tweets_df.ix[:,0]:
     print(i)


Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks

Answer 1

我会这样做：

states =  {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}

# sample DF
data = """\
location
Indiana, USA
Burlington, VT
United States
Saint Paul - Minneapolis, MN
Inland Valley, The Pass, S. CA
In the Dreamatorium
Nova Scotia;Canada
North Carolina, USA
INTP. West Michigan
Los Angeles, California
Waterbury Connecticut
Right side of the tracks
"""
df = pd.read_csv(io.StringIO(data), sep=r'\|')

re_states = r'.*,\s*(?:' + '|'.join(states.keys()) + ')'

df.loc[df.location.str.contains(re_states), 'location'].to_csv('filtered.csv', index=False)

说明：

In [3]: df
Out[3]:
                          location
0                     Indiana, USA
1                   Burlington, VT
2                    United States
3     Saint Paul - Minneapolis, MN
4   Inland Valley, The Pass, S. CA
5              In the Dreamatorium
6               Nova Scotia;Canada
7              North Carolina, USA
8              INTP. West Michigan
9          Los Angeles, California
10           Waterbury Connecticut
11        Right side of the tracks

生成RegEx：

In [9]: re_states
Out[9]: '.*,\\s*(?:VA|AK|ND|CA|CO|AR|MD|DC|KY|LA|OR|VT|IL|CT|OH|GA|WA|AS|NC|MN|NH|ID|HI|NA|MA|MS|WV|VI|FL|MO|MI|AL|ME|GU|NM|SD|WY|AZ|MP|DE|RI|PA|
NJ|WI|OK|TN|TX|KS|IN|NV|NY|NE|PR|UT|IA|MT|SC)'

搜索掩码：

In [10]: df.location.str.contains(re_states)
Out[10]:
0     False
1      True
2     False
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
Name: location, dtype: bool

Filtered DF：

In [11]: df.loc[df.location.str.contains(re_states)]
Out[11]:
                       location
1                Burlington, VT
3  Saint Paul - Minneapolis, MN

现在只需将其假脱机为CSV：

df.loc[df.location.str.contains(re_states), 'location'].to_csv('d:/temp/filtered.csv', index=False)

filtered.csv：

"Burlington, VT"
"Saint Paul - Minneapolis, MN"

<强>更新

从Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers开始。

在python中写入CSV

1 个答案: