Question

我正在使用Twitter数据进行情绪分析的小项目。我有包含数据的示例csv文件。但在做情绪分析之前。我必须清理数据。有一部分我被卡住了。这是代码。

tweets['source'][2]   ## Source is an attribute in csv file containing values
Out[51]: u'<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>'

我想清理源（数据）。我不希望通过网络链接和标签显示这些值。

以下是清理来源的代码：

tweets['source_new'] = ''

for i in range(len(tweets['source'])):
    m = re.search('(?)(.*)', tweets['source'][i])
    try:
        tweets['source_new'][i]=m.group(0)
    except AttributeError:
        tweets['source_new'][i]=tweets['source'][i]

tweets['source_new'] = tweets['source_new'].str.replace('', ' ', case=False)

但是当我执行代码时。我收到了这个错误：

Traceback (most recent call last):

  File "<ipython-input-50-f92a7f05ad1d>", line 2, in <module>
    m = re.search('(?)(.*)', tweets['source'][i])

  File "C:\Users\aneeq\Anaconda2\lib\re.py", line 146, in search
    return _compile(pattern, flags).search(string)

  File "C:\Users\aneeq\Anaconda2\lib\re.py", line 251, in _compile
    raise error, v # invalid expression

error: unexpected end of pattern

我收到错误提示＆＃39;错误：模式意外结束＆＃34;。能帮到我吗？我无法找到我正在处理的代码问题。

Answer 1

我应首先说明为此任务使用正则表达式不是一个好主意 ¹ ²

说到这里，我看到两种方法可以根据你的背景来实现这个目标：

如果你真的不知道你会遇到什么标签

我们可以获取HTML文本值，执行以下操作：

# Replace any HTML tag with empty string
value = re.sub('<[^>]*>', '', tweets['source'][i])
tweets['source_new'] = value

如果您知道将要遇到的标签（推荐）

这将是我推荐的方法（如果你真的需要使用正则表达式），因为它更明确，更不容易出现任何意外。

# Replace any HTML "a" tag with empty string
value = re.sub('(?i)<\/?a[^>]*>', '', tweets['source'][i])
tweets['source_new'] = value

或者，您可以查看How to remove HTML tags from a String on Python了解其他选项和方法。

¹ Using a Regex to remove HTML tags from a string

² Using Regex to parse HTML

Python：＆＃39;意外结束模式＆＃39;

1 个答案:

如果你真的不知道你会遇到什么标签

如果您知道将要遇到的标签（推荐）