我有csv
个文件
ID,"address","used_at","active_seconds","pageviews"
0a1d796327284ebb443f71d85cb37db9,"vk.com",2016-01-29 22:10:52,3804,115
0a1d796327284ebb443f71d85cb37db9,"2gis.ru",2016-01-29 22:48:52,214,24
0a1d796327284ebb443f71d85cb37db9,"yandex.ru",2016-01-29 22:14:30,4,2
0a1d796327284ebb443f71d85cb37db9,"worldoftanks.ru",2016-01-29 22:10:30,41,2
我需要删除包含一些单词的字符串。有117个单词。
我试试
for line in df:
if 'yandex.ru' in line:
df = df.replace(line, '')
但对于117个单词,它的工作速度太慢,之后我创建了pivot_table
和我尝试删除的单词,包含在列中。
aaa 10ruslake.ru youtube.ru 1tv.ru 24open.ru
0 0025977ab2998580d4559af34cc66a4e 0 0 34 43
1 00c651e018cbcc8fe7aa57492445c7a2 230 0 0 23
2 0120bc30e78ba5582617a9f3d6dfd8ca 12 0 0 0
3 01249e90ed8160ddae82d2190449b773 25 0 13 25
该列仅包含0
如何更快地完成并删除行,以便这些单词不在列中?
答案 0 :(得分:1)
IIUC您可以isin
使用boolean indexing
:
print df
ID address used_at \
0 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52
1 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52
2 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52
3 0a1d796327284ebb443f71d85cb37db9 yandex.ru 2016-01-29 22:14:30
4 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30
active_seconds pageviews
0 3804 115
1 3804 115
2 214 24
3 4 2
4 41 2
words = ['vk.com','yandex.ru']
print ~df.address.isin(words)
0 False
1 False
2 True
3 False
4 True
Name: address, dtype: bool
print df[~df.address.isin(words)]
ID address used_at \
2 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52
4 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30
active_seconds pageviews
2 214 24
4 41 2
然后使用pivot
:
print df[~df.address.isin(words)].pivot(index='ID', columns='address', values='pageviews')
address 2gis.ru worldoftanks.ru
ID
0a1d796327284ebb443f71d85cb37db9 24 2
另一个解决方案是删除行,当某些列为0
时(例如pageviews
):
print df
ID address used_at \
0 0a1d796327284ebb443f71d85cb37db9 youtube.ru 2016-01-29 22:10:52
1 0a1d796327284ebfsffsdf youtube.ru 2016-01-29 22:10:52
2 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52
3 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52
4 0a1d796327284ebb443f71d85cb37db9 yandex.ru 2016-01-29 22:14:30
5 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30
active_seconds pageviews
0 3804 0
1 3804 0
2 3804 115
3 214 24
4 4 2
5 41 2
print df.pageviews != 0
0 False
1 False
2 True
3 True
4 True
5 True
Name: pageviews, dtype: bool
print df[(df.pageviews != 0)]
ID address used_at \
2 0a1d796327284ebb443f71d85cb37db9 vk.com 2016-01-29 22:10:52
3 0a1d796327284ebb443f71d85cb37db9 2gis.ru 2016-01-29 22:48:52
4 0a1d796327284ebb443f71d85cb37db9 yandex.ru 2016-01-29 22:14:30
5 0a1d796327284ebb443f71d85cb37db9 worldoftanks.ru 2016-01-29 22:10:30
active_seconds pageviews
2 3804 115
3 214 24
4 4 2
5 41 2
print df[(df.pageviews != 0)].pivot_table(index='ID', columns='address', values='pageviews')
address 2gis.ru vk.com worldoftanks.ru yandex.ru
ID
0a1d796327284ebb443f71d85cb37db9 24 115 2 2
答案 1 :(得分:0)
我知道处理csv文件的最快方法是使用包Pandas从中创建数据帧。
import pandas as pd
df = pd.read_csv(the_path_of_your_file,header = 0)
df.ix[df.ix[:,'address'] == 'yandex.ru','address'] = ''
用一个空字符串替换包含'yandex.ru'的单元格。 然后你可以用csv写回来:
df.to_csv(the_path_of_your_file)
如果您要执行的操作是删除发生该网址的行,请使用:
df = df.drop(df[df.address == 'yandex.ru'].index)