Question

我正在使用CSV和DataFrames从Twitter分析中读取数据。

我想从某个单元格中提取网址

输出是这个过程如下

tweet number tweet id               tweet link              tweet text
1            1.0086341313026E+018   "tweet link goes here"  tweet text goes here https://example.com"

如何切换此“推文”以获取它的网址？我无法使用[-1：-12]对其进行切片，因为有很多不同字符编号的推文。

Answer 1

我相信你想要：

print (df['tweet text'].str[-12:-1])
0    example.com
Name: tweet text, dtype: object

更一般的解决方案是使用regex并str.findall获取所有链接的列表，如有必要，请先选择str[0]索引：

pat = r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?'

print (df['tweet text'].str.findall(pat).str[0])
0    https://example.com
Name: tweet text, dtype: object

Answer 2

以下是使用字符串列表和pd.Series.apply查找有效网址的一种方法：

s = pd.Series(['tweet text goes here https://example.com',
               'some http://other.com example',
               'www.thirdexample.com is here'])

test_strings = ['http', 'www']

def url_finder(x):
    return next(i for i in x.split() if any(t in i for t in test_strings))

res = s.apply(url_finder)

print(res)

0     https://example.com
1        http://other.com
2    www.thirdexample.com
dtype: object

Answer 3

如果域名长度是可变的，而不是总是11个字符长，这里有一个替代方案：

In [2]: df['tweet text'].str.split('//').str[-1]

Out[2]:
1    example.com
Name: tweet text, dtype: object

读取csv后从单元格切片的数据帧

3 个答案: