我有一些商业软件生成的文本文件,如下所示。它由方括号分隔的部分组成,每个部分都包含数百万个元素,但确切的值会从一种情况变为另一种情况。
using System.Drawing;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
namespace WindowsFormsApp2
{
public partial class Form1 : Form
{
Point _coordinates;
public Form1()
{
this._coordinates = new Point();
this.InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
}
public void Form1_MouseMove(object sender, MouseEventArgs e)
{
this._coordinates = new Point(e.X, e.Y);
this.Invalidate();
}
private void Form1_Paint(object sender, PaintEventArgs e)
{
// Don't draw on first Paint event
if(this._coordinates.X != 0 && this._coordinates.Y != 0)
{
this.DrawRect(e);
}
}
public void DrawRect(PaintEventArgs e)
{
using (Pen pen = new Pen(Color.Azure, 4))
{
Rectangle rect = new Rectangle(0, 0, this._coordinates.X, this._coordinates.Y);
e.Graphics.DrawRectangle(pen, rect);
}
}
}
}
我需要实现如下输出:
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
我发现了一种复杂的方式:
执行sed操作以获取
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
按如下所示使用awk将我的文件拆分为几个子文件
1
2
3
...
#
11
22
33
...
#
111
222
333
...
使用sed再次从我的子文件中删除空格
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
将所有内容组合在一起:
sed -i '/^[[:space:]]*$/d' splitted*.txt
添加一个字段分隔符(在我的bash脚本中定义)
paste splitted*.txt > out.txt
我几次循环遍历一百万行时,感觉很糟糕。 即使返回时间很正常(〜80秒),我也想找到一个完整的awk解决方案,但无法解决。 像这样:
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
我发现了一些相关的问题,尤其是这个row to column conversion with awk,但它假设括号之间的行数恒定,这是我无法做到的。
任何帮助将不胜感激。
答案 0 :(得分:5)
对于多字符RS和真正的多维数组,使用GNU awk:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
答案 1 :(得分:4)
如果您知道有3列,则可以通过以下方法进行处理:
pr -3ts <file>
接下来要做的就是去掉括号:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
您也可以在单个awk行中执行此操作,但这只会使事情复杂化。上面的过程很简单快捷。
此awk程序执行完整的通用版本:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>
答案 2 :(得分:3)
考虑到您的实际Input_file与所示示例相同,请尝试以下操作。
awk -v RS="" '
{
gsub(/\n|, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
}
}
for(j=1;j<=num;j++){
print val[j]
}
delete val
delete array
value=""
}' OFS="; "
OR(上面的脚本正在考虑(...)
中的数字将是恒定的,现在添加的脚本甚至可以在(....)
内部的字段编号不相等的情况下工作。
awk -v RS="" '
{
gsub(/\n/,",")
gsub(/, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
max=num>max?num:max
}
}
for(j=1;j<=max;j++){
print val[j]
}
delete val
delete array
}' OFS="; "
输出如下。
1; 11; 111
2; 22; 222
3; 33; 333
说明: 在此处添加上述代码的说明。
awk -v RS="" ' ##Setting RS(record separator) as NULL here.
{ ##Starting BLOCK here.
gsub(/\n/,",") ##using gsub to substitute new line OR comma with space with comma here.
gsub(/, /,",")
}
1' Input_file | ##Mentioning 1 will be printing edited/non-edited line of Input_file. Using | means sending this output as Input to next awk program.
awk ' ##Starting another awk program here.
{
while(match($0,/\([^\)]*/)){ ##Using while loop which will run till a match is FOUND for (...) in lines.
value=substr($0,RSTART+1,RLENGTH-2) ##storing substring from RSTART+1 to till RLENGTH-1 value to variable value here.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line with substring valeu from RSTART+RLENGTH till last of line.
num=split(value,array,",") ##Splitting value variable into array named array whose delimiter is comma here.
for(i=1;i<=num;i++){ ##Using for loop which runs from i=1 to till value of num(length of array).
val[i]=val[i]?val[i] OFS array[i]:array[i] ##Creating array val whose index is value of variable i and concatinating its own values.
}
}
for(j=1;j<=num;j++){ ##Starting a for loop from j=1 to till value of num here.
print val[j] ##Printing value of val whose index is j here.
}
delete val ##Deleting val here.
delete array ##Deleting array here.
value="" ##Nullifying variable value here.
}' OFS="; " ##Making OFS value as ; with space here.
注意: :这也适用于(...)
括号内的3个以上的值。
答案 3 :(得分:2)
awk 'BEGIN { RS = "\\s*[()]\\s*"; FS = "\\s*" }
NF > 0 {
maxCol++
if (NF > maxRow)
maxRow = NF
for (row = 1; row <= NF; row++)
a[row,maxCol] = $row
}
END {
for (row = 1; row <= maxRow; row++) {
for (col = 1; col <= maxCol; col++)
printf "%s", a[row,col] ";"
print ""
}
}' yourFile
输出
1;11;111;
2;22;222;
3;33;333;
...;...;...;
当您还想在字段中保留空格时,请将FS= "\\s*"
更改为FS = "\n*"
。
此脚本支持不同长度的列。
在进行基准测试时,还应考虑将GNU [i,j]
的{{1}}替换为[i][j]
。我不确定哪一个更快,并且自己没有对脚本进行基准测试。
答案 4 :(得分:1)
这是Perl一线解决方案
$ cat edouard2.txt
(1
2
3
a
)
(11
22
33
b
)
(111
222
333
c
)
$ perl -lne ' $x=0 if s/[)(]// ; if(/(\S+)/) { @t=@{$val[$x]};push(@t,$1);$val[$x++]=[@t] } END { print join(";",@{$val[$_]}) for(0..$#val) }' edouard2.txt
1;11;111
2;22;222
3;33;333
a;b;c
答案 5 :(得分:0)
我会将每个部分转换为一行,然后在例如假设您使用的是GNU awk:
<infile awk '{ gsub("[( )]", ""); $1=$1 } 1' RS='\\)\n\\(' OFS=';' |
datamash -t';' transpose
输出:
1;11;111
2;22;222
3;33;333
...;...;...