Question

我是编程和正则表达式的新手并阅读掌握正则表达式，但我找不到如何摆脱标签，换行符和奇怪的非单词或非数字字符（图标和奇怪的非字符）的答案我的tsv文件的text列中的西部换行符（？）。它是utf-8格式和瑞典语。

看起来像这样：

"from_user","month","full_text"
"bellaboo",4,"RT @BodilMalmsten: \"om man klarar av att föra ett bestick till munnen eller      behöver hjälp på toaletten\"
Have a heart, borgarrådet
Have a hea,RT @BodilMalmsten: Borgarrådet om riktlinjerna \"om man klarar av att föra ett   bestick till munnen eller behöver hjälp på toaletten\"
Hjälp
1   min dröm
2   allas önskningar
3   viljan att segra
H,RT @BodilMalmsten: Klarar du av att föra ett bestick till munnen eller behöver hjälp på  toaletten?
http://t.co/fcvcf0U2dW"

任何人都可以帮助我，所以我继续进行文本分析，我真的要处理这个文件吗？

Answer 1

由于您使用python-3.x标记了问题，因此这是一个Python 3.x答案。

我认为您遇到的问题是CSV阅读器会对第三列内的所有换行感到不安。该程序删除所有额外的换行符并规范化所有空格（单词由单个空格分隔）。

我正在使用带有注释的“详细”Python模式来明确它与列的匹配程度。棘手的是第三个，它可以包含换行符。它只是匹配任何东西，直到看到终止双引号。

我不确定你要怎么清理弦乐;我给出的模式只是用空格替换所有“控制字符”（ASCII 0x01到0x1f，加上ASCII DEL字符0x7f）。然后空格规范化清除任何额外的空格。

import re
import sys

_, infile, outfile = sys.argv

s_pat_row = r'''
    "([^"]+)"  # match column; this is group 1
    \s*,\s*  # match separating comma and any optional white space
    (\S+)  # match column; this is group 2
    \s*,\s*  # match separating comma and any optional white space
    "((?:\\"|[^"])*)"  # match string data that can include escaped quotes
'''
pat_row = re.compile(s_pat_row, re.MULTILINE|re.VERBOSE)

s_pat_clean = r'''[\x01-\x1f\x7f]'''
pat_clean = re.compile(s_pat_clean)

row_template = '"{}",{},"{}"\n'

with open(infile, "rt") as inf, open(outfile, "wt") as outf:
    data = inf.read()
    for m in re.finditer(pat_row, data):
        row = m.groups()
        cleaned = re.sub(pat_clean, ' ', row[2])
        words = cleaned.split()
        cleaned = ' '.join(words)
        outrow = row_template.format(row[0], row[1], cleaned)
        outf.write(outrow)

您可以编辑s_pat_clean中指定的模式以清除您需要清理的任何字符。

要使用此功能，请将其保存在名为cleaner.py的文件中，并将输入放在名为data.txt的文件中，然后运行：

python3 cleaner.py data.txt cleaned.txt

结果保存在输出文件cleaned.txt中。

在您提供的示例上运行此操作的结果：

"from_user","month","full_text"
"bellaboo",4,"RT @BodilMalmsten: \"om man klarar av att föra ett bestick till munnen eller behöver hjälp på toaletten\"Have a heart, borgarrådet Have a hea,RT @BodilMalmsten: Borgarrådet om riktlinjerna \"om man klarar av att föra ett bestick till munnen eller behöver hjälp på toaletten\" Hjälp 1 min dröm 2 allas önskningar 3 viljan att segra H,RT @BodilMalmsten: Klarar du av att föra ett bestick till munnen eller behöver hjälp på toaletten? http://t.co/fcvcf0U2dW"

现在，CSV阅读器在解析文件时应该没有问题。

编辑：使用正确的输入重新运行程序并替换输出示例，并在正确的输入上运行结果。当输入具有重音时，正如您在上面看到的那样，它们被正确地传递。

Answer 2

如果你想删除除“常规”（英语）“word”字符之外的所有内容，你可以这样做（例如PHP，因为你没有指定语言。模式本身是[^\w ]或者你的语言不支持速记char类，可以使用[^a-zA-Z0-9_ ]）：

$string = preg_replace('~[^\w ]~','',$string);

如果您想要使用utf-8模式，因为您提到瑞典语（如果您想删除或保留它，则不是很清楚），您可以使用u修饰符：

$string = preg_replace('~[^\w ]~u','',$string);

再次，这是php的例子;你没有说明你使用的语言..

IOW实际的正则表达式模式将是

[^\w ]

或

[^a-zA-Z0-9_ ]

如果您想保留瑞典字符，则需要在任何语言或环境中启用utf-8支持。

编辑：我也投入了常规空间，因为您可能也希望保留它！

编辑2：实际上换行字符是其他可以处理的蠕虫。由于他们将您的数据放在不同的行上，因此会尝试使用内置函数来读取（csv）文件。您可以做的就是纠正这个问题，在上面之前，首先打开整个文件并运行\r?\n(?=[^"])替换为“”（空字符串）。 php版本将是

$string = preg_replace('~\r?\n(?=[^"])~','',$string);

我们的想法是剥离所有新行，除了那些后面有引号的行，假设你的列用引号括起来并保留文件中的实际行

编辑3：这就是我在php中的表现。我希望我知道足够的python给你python版本但我没有，所以也许你可以弄清楚如何运行php版本（它真的不那么难......）或者让别人为你翻译它。

首先运行此脚本：

<?php
/* 
 STEP 1:
 run this on original data initially, to strip all newlines, except for the 
 ones thatare supposed to be there to start a new row 
*/
// get the data from the original data file
$file = file_get_contents('data.csv');
// strip out newline chars that are not followed by a quote
$file = preg_replace('~\r?\n(?=[^"])~','',$file);
// write the data to a new file to preserve original data
file_put_contents('data2.csv',$file);
?>

然后运行此脚本：

/*
 STEP 2:
 run this to strip out non-word chars and extra spaces, preserving swedish chars
*/
// set php to parse using Swedish settings (e.g. utf-8)
setlocale (LC_ALL, "Swedish");
// open the new file that's scrubbed of bad newlines
$handle = fopen("data2.csv", "r");
// also let's open another file to put in final scrubbed data in
$handle2 = fopen("data3.csv", "w");
// for each row in the file.. (fgetcsv puts the columns into an array $data) 
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
  // for each column in the current row...
  array_walk($data,function(&$value) {
    // first let's strip all non-word chars except spaces 
    $value = preg_replace('~[^\w ]~','',$value);
    // then let's consolodate multiple spaces into a single space
    $value = preg_replace('~ +~',' ',$value);
  });
  // now let's write the scrubbed row to the new file. we're going to use fwrite
  // instead of fputcsv because fputcsv will not always wrap the columns
  // in quotes. So we're going to ensure that each column has quote wrappers, 
  // same as original. This isn't a problem for most csv parsers but just in 
  // case you're rolling your own.. 
  fwrite($handle2,'"'.implode('","',$data).'"'.PHP_EOL);
}
// finally, let's close the files. 'data3.csv' contains the final scrubbed data
fclose($handle);
fclose($handle2);

正则表达式删除csv文件中的空格，用引号分隔文本？

2 个答案: