Question

我正在尝试清理一个包含垃圾的网址列表，如图所示。

/gradoffice/index.aspx（
/gradoffice/index.aspx -
/gradoffice/index.aspxjavascript $
/gradoffice/index.aspx~

我有一个csv文件，其中包含超过190k的不同网址记录。我尝试将csv加载到pandas数据框中，并使用语句将整个url列列入列表

str = df['csuristem']

它清楚地给了我列中的所有值。当我使用下面的代码 - 它只打印40k记录，它开始在中间的某些地方。我不知道哪里出错了。该程序运行完美，但只向我显示部分数量的结果。任何帮助将不胜感激。

import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
    s = s.split(".")[0]
    print s

我希望得到像这样的输出

/ gradoffice /索引。
/ gradoffice /索引。
/ gradoffice /索引。
/ gradoffice /索引。

谢谢你， Santhosh。

Answer 1

您需要执行以下操作，因此请在列上调用.str.split，然后调用.str[0]以访问感兴趣的拆分字符串的第一部分：

In [6]:

df['csuristem'].str.split('.').str[0]
Out[6]:
0    /gradoffice/index
1    /gradoffice/index
2    /gradoffice/index
3    /gradoffice/index
Name: csuristem, dtype: object

使用split（）在python数据帧中的整个列中拆分值

1 个答案: