linux awk简明笔记

bianhuakairi

2010-11-12

关注关注

awk是一个简单强大的文本过滤报告工具。

简介

其基本结构为pattern{action}

awk是一个面向行的编程工具。pattern定义了怎么匹配一个行，action定义了匹配成功的动作。默认为所有的行都匹配，BEGIN和END关键字，一个在所有行处理之前执行，一个在所有行处理之后执行。

awk把每一行看作一些field，$0代表整行，$1代表第一个field，等等。

和很多脚本语言不一样的是，$number不代表一个变量，而是一个field。

也不对""之中的符号进行特别解释。

换行不能乱用，可以

{action1}

{action2}

awk中有3种常量，字符串，数字，正则表达式。

数字和字符串的转换：

字符串转数字，贪婪算法转该字符串的顶头子字符串，如果不行为0。

数字转字符串，CONVFMT控制。整数不受CONVFMT控制。

awk的底层类型有两种，数字，字符串，一个变量可以既是数字，又是字符串，底层保有2个值。原因嘛，效率，精度。

运算符

比较的时候，使用的是最后赋值的类型，而不是最后使用的类型。

对于不确定类型（如传入参数，域）等，能数字比较的数字比较，不能的话变为字符串比较。

和c一脉相传的东西（很相似）

运算符

二元+-*/%<space><space>为字符串连接操作符。

一元+(正号)-(负号)

自增运算符++--

+=-=*=/=%=

=(赋值运算符)()可以明确运算符的优先级。

逻辑符

==!=>>=<<=

&&||!

检查一个字符串是否匹配

!~/正则表达式/不匹配正则表达式

~/正则表达式/匹配正则表达式

命令

if ( conditional ) statement [ else statement ]
while ( conditional ) statement
for ( expression ; conditional ; expression ) statement
for ( variable in array ) statement
break
continue
{ [ statement ] ...}
variable=expression
print [ expression-list ] [ > expression ]
printf format [ , expression-list ] [ > expression ]
next  这个挺好用的，跳出当前处理的行。
exit   这个和一般的exit不同，跳过所有的行，但是会执行END。

内置常量

FS-TheInputFieldSeparatorVariable

用-F可以指定分隔符。FS是buildin的分隔符变量。

经典例子

行为OneTwo:Three:4Five

{
	print $2
	FS=":"
	print $2
}

输出Two:Three:4Five两次。

防止side-effect，FS的设置在readline之后有效，在一次readline之中变量不会改变，可以消除很多bug。

OFS-TheOutputFieldSeparatorVariable

NF-TheNumberofFieldsVariable

NR-TheNumberofRecordsVariable

RS-TheRecordSeparatorVariable

ORS-TheOutputRecordSeparatorVariable

FILENAME-TheCurrentFilenameVariable

AssociativeArrays

Insteadofusinganumbertofindanentryinanarray,useanythingyouwant.

关联数组的key全部都是string。

直接引用数组中不存在的key会创建一个pair。

用keyinarray可以测试该数组是否有一个key的pair并且不创建pair。

delete：删除元素。

BEGIN {
	username[""]=0;
}
{
	username[$3]++;
}
END {
	for (i in username) {
		if (i != "") {
			print username[i], i;
		}
	}
}

Multi-dimensionalArrays

a[1,2]=y;这个不行。

a[1","2]=y;记住，<space>是字符串连接。

ExampleofusingAWK'sAssociativeArrays

1记住初始化一个值。

2好的变量命名。

3确保输入正确。

4给每一个field起一个好名字。

5一遍遍历，得到所有东西。

一个好例子，值得好好研究，用户，组，文件所占空间大小检测程序。

#!/bin/sh
find . -type f -print | xargs /usr/bin/ls -islg | 
awk '
BEGIN {
# initialize all arrays used in for loop
	u_count[""]=0;
	g_count[""]=0;
	ug_count[""]=0;
	all_count[""]=0;
}
{
# validate your input
        if (NF != 11) {
#	ignore
	} else {
# assign field names
		inode=$1;
		size=$2;
		linkcount=$4;
		user=$5;
		group=$6; 

# should I count this file?

		doit=0;
		if (linkcount == 1) {
# only one copy - count it
			doit++;
		} else {
# a hard link - only count first one
			seen[inode]++;
			if (seen[inode] == 1) {
				doit++;
			}
		}
# if doit is true, then count the file
		if (doit ) {

# total up counts in one pass
# use description array names
# use array index that unifies the arrays

# first the counts for the number of files

			u_count[user " *"]++;
			g_count["* " group]++;
			ug_count[user " " group]++;
			all_count["* *"]++;

# then the total disk space used

			u_size[user " *"]+=size;
			g_size["* " group]+=size;
			ug_size[user " " group]+=size;
			all_size["* *"]+=size;
		}
        }
}
END {
# output in a form that can be sorted
        for (i in u_count) {
				if (i != "") {
        	        print u_size[i], u_count[i], i;
				}
        }
        for (i in g_count) {
				if (i != "") {
        	        print g_size[i], g_count[i], i;
				}
        }
        for (i in ug_count) {
				if (i != "") {
        	        print ug_size[i], ug_count[i], i;
				}
        }
        for (i in all_count) {
				if (i != "") {
        	        print all_size[i], all_count[i], i;
				}
        }
} ' | 
# numeric sort - biggest numbers first
# sort fields 0 and 1 first (sort starts with 0)
# followed by dictionary sort on fields 2 + 3
sort +0nr -2 +2d | 
# add header
(echo "size count user group";cat -) |
# convert space to tab - makes it nice output
# the second set of quotes contains a single tab character
tr ' ' '	' 
# done - I hope you like it

note

note:虽然是脚本，但是防错，注释，明确初始值等等好的脚本编程习惯还是挺有用的。

awk 正则表达式 linux系统