Question

我想对包含2000行的Weka arff文件进行预处理对于nlp项目（情绪分析）

我想要一个代码，只需在每个句子的开头和结尾添加一个引号。例如，这是我的数据集的示例：

The Da Vinci Code is one of the most beautiful movies ive ever seen.,1
The Da Vinci Code is an * amazing * book, do not get me wrong.,1
then I turn on the light and the radio and enjoy my Da Vinci Code.,1
The Da Vinci Code was REALLY good.,1
i love da vinci code....,1

我希望输出为：

'The Da Vinci Code is one of the most beautiful movies ive ever seen.',1
'The Da Vinci Code is an * amazing * book, do not get me wrong.',1
'then I turn on the light and the radio and enjoy my Da Vinci Code.',1
'The Da Vinci Code was REALLY good.',1
'i love da vinci code....',1

只想在每个句子的开头和结尾添加一个引号（在1之前）。

如果你帮助我，我会非常感激

我可以使用任何工具而不是编写代码吗？

Answer 1

您可以使用正则表达式来实现此目的。 Regular expressions are a powerful formalism for pattern matching in strings.大量现有工具支持正则表达式，它允许您匹配/替换所需的文本，而无需自己编写任何代码。

要使用正则表达式（regexp）进行匹配和替换，您需要两个部分：

匹配：匹配字符串或字符串中某些内容的表达式。
替换/替换：表示要替换的内容的表达式匹配。

<强>匹配

/([^\.]+)(\.+)(,1\s+)/g

第1组：匹配除文字点以外的所有字符，至少为1 字符。
第2组：仅匹配文字点，至少1个字符。
第3组：匹配文字逗号，后跟文字1，后跟至少有1个空白字符。
正则表达式标志 g （全局）：多次匹配

<强>换人：

'$1$2'$3

用引号括起第1组和第2组，然后是第3组。

您可以查看上述匹配和替换here

现在，您可以使用该匹配和替换来使用您喜欢的正则表达式工具。

赞sed：

sed -i -E "s/([^\.]+)(\.+)(,1\s+)/'\1\2'\3/g" yourfile.txt

或Windows PowerShell：

(Get-Content yourfile.txt) -replace '([^\.]+)(\.+)(,1\s+)', '''$1$2''$3' | Out-File output.txt

_{其他工具可能使用不同的语法。提供的匹配/替换模式可能会进一步优化。}

调整WEKA的arff文件格式

1 个答案: