从Twitter数据集中删除或替换URL和RT

时间:2018-06-07 00:47:28

标签: regex twitter classification rapidminer

所以现在我正在尝试使用Twitter数据集对文本分类进行数据清理。但我有一个关于如何替换(或可能删除)URL,RT和@字符的问题。我已经在论坛上阅读了一些帖子,但我仍然无法理解任何内容。

对于数据集上的URL,我想更改格式为" https:"或" http:"到"链接" (我不知道为什么它不能像#34;")那样具有空值。但是在使用RapidMiner软件上的Replace运算符执行我的流程后,此示例的结果>> " http://blablabla"没有变成"链接"只是,但结果就像这样" linkblablabla"。也许它与RegEx有关?我知道什么是RegEx,但我不知道如何使用和编写它。]

我现在真的很困惑。请帮我。谢谢。

这是我的RapidMiner流程:

== MCP 9.40 (data: 9.40, client: 1.12, server: 1.12) ==
Searching for javac.exe in C:\Program Files
"scalac" is not found on the PATH.  Scala files will not be recompiled
# found ff, ff patches, srgs, name csvs, doc csvs, param csvs, astyle, 
astyle config, rg, ss
Looking in C:\Users\newuser\AppData\Roaming\.minecraft\versions for mc 
installs... OK
Copying assets... OK
Parsing JSON file... OK
Looking for minecraft main jar... Not found
Copying minecraft main jar... OK
Checking libraries...
    Copying library jinput... OK
    Copying library icu4j-core-mojang... OK
    Copying library httpcore... OK
    Copying library log4j-api... OK
    Copying library commons-lang3... OK
    Copying library jna... OK
    Copying library lwjgl-platform-natives-windows... OK
    Copying library libraryjavasound... OK
    Copying library jopt-simple... OK
    Copying library text2speech... OK
    Copying library guava... OK
    Copying library oshi-core... OK
    Copying library httpclient... OK
    Copying library commons-compress... OK
    Copying library text2speech-natives-windows... OK
    Copying library fastutil... OK
    Copying library platform... OK
    Copying library codecjorbis... OK
    Copying library soundsystem... OK
    Copying library librarylwjglopenal... OK
    Copying library lwjgl_util... OK
    Copying library commons-codec... OK
    Copying library jutils... OK
    Copying library patchy... OK
    Copying library commons-logging... OK
    Copying library lwjgl... OK
    Copying library commons-io... OK
    Copying library realms... OK
    Copying library authlib... OK
    Copying library gson... OK
    Copying library jinput-platform-natives-windows... OK
    Copying library codecwav... OK
    Copying library log4j-core... OK
    Copying library netty-all... OK
Checking Natives...
    Extracting native lwjgl.dll... OK
    Extracting native OpenAL32.dll... OK
    Extracting native jinput-dx8_64.dll... OK
    Extracting native SAPIWrapper_x86.dll... OK
    Extracting native jinput-wintab.dll... OK
    Extracting native jinput-dx8.dll... OK
    Extracting native jinput-raw.dll... OK
    Extracting native OpenAL64.dll... OK
    Extracting native SAPIWrapper_x64.dll... OK
    Extracting native jinput-raw_64.dll... OK
    Extracting native lwjgl64.dll... OK
Copying jsr305-3.0.1.jar to Libraries
Copying jsr305-3.0.1-sources.jar to Libraries
== Decompiling client using fernflower ==
Creating SRGs
Applying SpecialSource
Applying MCInjector
Creating renamed srg
Filtering classes
Decompiling
Unpacking jar
Copying sources
Generating package-info files
Applying fernflower fixes
Applying patches
'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

1 out of 1 hunk FAILED -- saving rejects to file 
'net\minecraft\advancements\PlayerAdvance#'
==================

'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

1 out of 2 hunks FAILED -- saving rejects to file 
'net\minecraft\client\renderer\entity\RenderManager#'
==================

'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

1 out of 2 hunks FAILED -- saving rejects to file 
'net\minecraft\util\math\Cartesian.jav#'
==================

'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

1 out of 1 hunk FAILED -- saving rejects to file 
'net\minecraft\client\util\SearchTree.ja#'
==================

'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

1 out of 1 hunk FAILED -- saving rejects to file 
'net\minecraft\client\renderer\block\statemap\StateMap.java#'
==================

'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

1 out of 1 hunk FAILED -- saving rejects to file 
'net\minecraft\client\gui\GuiSnooper.ja#'
==================

'runtime\bin\applydiff.exe -p1 -u -i ..\..\temp\temp.patch -d src\minecraft' 
failed : 1

== ERRORS FOUND ==

2 out of 2 hunks ignored -- saving rejects to file 
'net\minecraft\block\state\pattern\BlockStateMat#'
==================

Cleaning comments
- Done in 282.32 seconds
== Reformating client ==
Cleaning sources
Replacing OpenGL constants
Reformating sources
- Done in 50.78 seconds
== Updating client ==
Adding javadoc
Renaming sources
- Done in 54.21 seconds
!! Missing server jar file. Aborting !!
== Recompiling client ==
Cleaning bin
Recompiling

1 个答案:

答案 0 :(得分:0)

表达式的问题在于它正确识别" https:"部分但在您所述的替换参数中,该匹配应替换为" link"。这导致了"" linkblablabla"输出

如果您想用替换令牌替换完整链接"链接"您需要以下RegEx:

(https?|http)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

然后输入[link]或""在替换领域。要么替换链接,要么完全删除它。

您可以在RapidMiner Community中查看更详细的解释,详细说明。 enter image description here