我正在完成一项任务,我有两个CSV文件。第一个文件包含我的整个数据(如下所示)
0 ID Name Suburb State Postcode Email Lat Lon
0 0 1 Hurstville Clinic Hurstville NSW 1493 hurstville@myclinic.com.au -33.975869 151.088939
1 1 2 Sydney Centre Clinic Sydney NSW 2000 sydney@myclinic.com.au -33.867139 151.207114
2 2 3 Auburn Clinic Auburn NSW 2144 auburn@myclinic.com.au -33.849322 151.033421
3 3 4 Riverwood Clinic Riverwood NSW 2210 riverwood@myclinic.com.au -33.949859 151.052469
第二个文件包含我必须用第一个文件Email Column
替换的数据。
我使用Regex将第二个文件转换为HTML链接。
这就是我为清理数据所做的工作:
def clean(filename):
df = pd.read_csv(filename)
df['Email'] = df['Email'].apply(lambda x: x if '@' in str(x) else str(x)+'@myclinic.com.au')
return df.to_csv('temp1.csv')
Email
<a href="hurstville@myclinic.com.au"></a>
<a href="sydney@myclinic.com.au"></a>
<a href="auburn@myclinic.com.au"></a>
<a href="riverwood@myclinic.com.au"></a>
<a href="bay@myclinic.com.au"></a>
<a href="harrington@myclinic.com.au"></a>
<a href="forest@myclinic.com.au"></a>
这是不正确的。上面的函数在电子邮件列中的空格之前省略了所有内容,并且还忽略了在电子邮件列中@之前有空格的任何行。
这就是我要做的事情: a)清理文件一(两个不需要的列)的数据和电子邮件列中有一些空格,因为某些地址在名称中有空格,并且在最终函数中无法读取它们。 b)我得到的最终输出不是我想要的输出。它省略了10行:
这就是我在file2中所做的。
emails = re.findall(r'\S+@\S+', text)
for x in range(0, len(emails)):
emails[x] = '<a href="%s"></a>' % emails[x];
emails.insert(0, 'Email')
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in emails:
writer.writerow([val])
这里,text是一个包含我的整个csv数据的dictonary。调用CSV文件的内容我只是在我的python文件中复制了整个内容。
ID Name Suburb State Postcode Email_Str Lat Lon
0 1 Hurstville Clinic Hurstville NSW 1493 <a href="hurstville@myclinic.com.au"></a> -33.975869 151.088939
1 2 Sydney Centre Clinic Sydney NSW 2000 <a href="sydney@myclinic.com.au"></a> -33.867139 151.207114
2 3 Auburn Clinic Auburn NSW 2144 <a href="auburn@myclinic.com.au"></a> -33.849322 151.033421
3 4 Riverwood Clinic Riverwood NSW 2210 <a href="riverwood@myclinic.com.au"></a> -33.949859 151.052469
4 6 Harrington Clinic Harrington NSW 2427 <a href="harrington@myclinic.com.au"></a> -31.872153 152.689811
5 9 Benolong Clinic Benolong NSW 2830 <a href="benolong@myclinic.com.au"></a> -32.413736 148.63938
6 11 Preston Clinic Preston VIC 3072 <a href="preston@myclinic.com.au"></a> -37.738736 145.000515
7 13 Douglas Clinic Douglas VIC 3409 <a href="douglas@myclinic.com.au"></a> -37.842988 144.892631
8 14 Mildura Clinic Mildura VIC 3500 <a href="mildura@myclinic.com.au"></a> -34.181714 142.163072
9 15 Broadford Clinic Broadford VIC 3658 <a href="broadford@myclinic.com.au"></a> -37.203001 145.050171
10 16 Officer Clinic Officer VIC 3809 <a href="officer@myclinic.com.au"></a> -38.063056 145.40958
11 18 Langsborough Clinic Langsborough VIC 3971 <a href="langsborough@myclinic.com.au"></a> -38.651487 146.675098
12 19 Brisbane Centre Clinic Brisbane QLD 4000 <a href="brisbane@myclinic.com.au"></a> -27.46758 153.027892
13 20 Robertson Clinic Robertson QLD 4109 <a href="robertson@myclinic.com.au"></a> -27.565733 153.057213
14 22 Ipswich Clinic Ipswich QLD 4305 <a href="ipswich@myclinic.com.au"></a> -27.614604 152.760876
15 24 Caboolture Clinic Caboolture QLD 4510 <a href="caboolture@myclinic.com.au"></a> -27.085007 152.951707
16 25 Booie Clinic Booie QLD 4610 <a href="booie@myclinic.com.au"></a> -26.498426 151.935421
17 26 Rockhampton Clinic Rockhampton QLD 4700 <a href="rockhampton@myclinic.com.au"></a> -23.378941 150.512323
18 28 Cairns Clinic Cairns QLD 4870 <a href="cairns@myclinic.com.au"></a> -16.925397 145.775178
19 29 Adelaide Centre Clinic Adelaide SA 5000 <a href="adelaide@myclinic.com.au"></a> -34.92577 138.599732
最终合并后我的数据丢失
正如您所看到的那样,它缺少大量数据。 请帮帮我。
答案 0 :(得分:0)
不确定问题是什么,但您似乎正在尝试将电子邮件文本转换为电子邮件链接。你可以这样做:
df['Email'] = df['Email'].apply(lambda x: '<a href="' + x + '"></a>')
答案 1 :(得分:0)
看起来您正在尝试合并“电子邮件”列中的两个数据框,以便在合并后,您应该从DF2
获取与DF1
合并的电子邮件字符串。
如下工作代码(使用您的数据)
import pandas as pd
import re
df1 = pd.read_csv('file1.txt', sep=",", engine="python")
df2 = pd.read_csv('file2.txt', sep=",", engine="python")
def get_email(x):
return ''.join(re.findall(r'"([^"]*)"', x))
df2.columns = ['Email_Str']
df2['Email']=df2['Email_Str'].apply(get_email)
df2 = df2[['Email','Email_Str']]
df3=pd.merge(df1,df2,on='Email').drop(['Email'], axis=1)
df3 = df3[[u'ID', u'Name', u'Suburb', u'State', u'
Postcode',u'Email_Str', u'Lat', u'Lon', ]]
<强>结果:强>
>>> df3
ID Name Suburb State Postcode \
0 1 Hurstville Clinic Hurstville NSW 1493
1 2 Sydney Centre Clinic Sydney NSW 2000
2 3 Auburn Clinic Auburn NSW 2144
3 4 Riverwood Clinic Riverwood NSW 2210
Email_Str Lat Lon
0 <a href="hurstville@myclinic.com.au"></a> -33.975869 151.088939
1 <a href="sydney@myclinic.com.au"></a> -33.867139 151.207114
2 <a href="auburn@myclinic.com.au"></a> -33.849322 151.033421
3 <a href="riverwood@myclinic.com.au"></a> -33.949859 151.052469
>>>
希望这就是你要找的东西。
答案 2 :(得分:0)
我弄清楚问题是什么。 我的原始数据在电子邮件列中有空格。 那么任何人都可以更新我的正则表达式函数吗?
def clean(filename):
df = pd.read_csv(filename)
df['Email'] = df['Email'].apply(lambda x: x if '@' in str(x) else str(x)+'@myclinic.com.au')
return df.to_csv('temp1.csv')