POSIX.2正则表达式说明(确认一个正则表达式是否正确的唯一方法就是去测试它)

POSIX.2正则表达式说明

关于在Linux/Bash中正则表达式(POSIX.2 regular expressions)的语法形式,可以使用 man 7 regex 去查看。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
An expression is a string of characters. Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

正则表达式是特殊的字符串。在正则表达式中,有些字符具有特殊含义,而不是它本身的字面含义,这种字符称之为元字符(metacharacter)。

一个系统中的命令对正则表达式的支持程度,往往取决于其具体实现,所以有这么一种说法:“确认一个正则表达式是否正确的唯一方法就是去测试它”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
The only way to be certain that a particular RE works is to test it.

下面说明一下 man 7 regex 所描述的 POSIX.2 正则表达式。

POSIX.2 正则表达式(Regular expressions, 简称RE),有两种形式:

一种是 modern RE, 或者称之为 extended RE,比如 egrep命令所支持的;

一种是 obsolete RE, 或者称之为 basic RE,比如 ed命令所支持的。

man 7 regex 写道
Regular expressions (‘‘RE’’s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep;

1003.2callsthese‘‘extended’’REs)andobsoleteREs(roughlythoseofed(1);1003.2‘‘basic’’REs).Obsolete

REsmostlyexistforbackwardcompatibilityinsomeoldprograms;theywillbediscussedattheend.1003.2

leavessomeaspectsofREsyntaxandsemanticsopen;‘(!)’marksdecisionsontheseaspectsthatmaynotbe

fully portable to other 1003.2 implementations.
  

下面讲的是 modern RE,一个 modern RE 由一个或多个 branch 用 竖线(|) 分隔。一个字符串只需要匹配其中一个 branch 就认为是匹配该 RE。

比如:abc|def 既可匹配 abc 也可匹配 def。

man 7 regex 写道
A (modern) RE is one(!) or more non-empty(!) branches, separated by ‘|’. It matches anything that matches one of the branches.

一个 branch 由一个或多个 piece 串接而成:一个 piece 是由 atom 或者 atom 加上 modifier 组成。在匹配的时候是依次匹配。

modifier的作用是指定 atom 的出现次数,比如:

*  前面的 atom 出现 0次或多次;

+ 前面的 atom 出现 1次或多次;

? 前面的 atom 出现 0次或1次;

man 7 regex 写道
A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the

second,etc.

Apieceisanatompossiblyfollowedbyasingle(!)‘*’,‘+’,‘?’,orbound.Anatomfollowedby‘*’matchesa

sequenceof0ormorematchesoftheatom.Anatomfollowedby‘+’matchesasequenceof1ormorematchesof

the atom. An atom followed by ‘?’ matches a sequence of 0 or 1 matches of the atom.
 

modifier也可以是 bound ({}),即指定范围,前面的 atom 出现的次数在{}内指定,如下:

{n} 前面的 atom 刚好出现 n次,n必须在 0 到 RE_DUP_MAX 之间,其中 RE_DUP_MAX 最大为255;

{n,} 前面的 atom 出现 n次及以上;

{n,m} 前面的 atom 出现 n次到m次,必须 n <= m。

man 7 regex 写道
A bound is ‘{’ followed by an unsigned decimal integer, possibly followed by ‘,’ possibly followed by another

unsigneddecimalinteger,alwaysfollowedby‘}’.Theintegersmustliebetween0andRE_DUP_MAX(255(!))

inclusive,andiftherearetwoofthem,thefirstmaynotexceedthesecond.Anatomfollowedbyaboundcon-

tainingoneintegeriandnocommamatchesasequenceofexactlyimatchesoftheatom.Anatomfollowedbya

boundcontainingoneintegeriandacommamatchesasequenceofiormorematchesoftheatom.Anatomfol-

lowedbyaboundcontainingtwointegersiandjmatchesasequenceofithroughj(inclusive)matchesofthe

atom.
 

下面讲到 atom 是指哪些东西,一个 atom 可以如下之一:

(RE)               匹配一个正则表达式,子表达式

()                   匹配一个空串

[CHAR-SET]    匹配指定字符集合中的任意字符

.                    匹配任意单个字符

^                   匹配行首

$                   匹配行尾

\跟上^.[$()|*+?{\之一    转义,使这些元字符的特殊含义丧失,匹配这些字符本身

\跟上其他字符      匹配就是这些字符本身

其他单个字符     匹配这些字符本身

{跟上非数字字符     此时{是个普通字符

结尾为\           非法

man 7 regex 写道
An atom is a regular expression enclosed in ‘()’ (matching a match for the regular expression), an empty set of

‘()’(matchingthenullstring)(!),abracketexpression(seebelow),‘.’(matchinganysinglecharacter),‘^’

(matchingthenullstringatthebeginningofaline),‘$’(matchingthenullstringattheendofaline),a

‘\’followedbyoneofthecharacters‘^.[$()|*+?{\’(matchingthatcharactertakenasanordinarycharacter),

a‘\’followedbyanyothercharacter(!)(matchingthatcharactertakenasanordinarycharacter,asifthe

‘\’hadnotbeenpresent(!)),orasinglecharacterwithnoothersignificance(matchingthatcharacter).A

‘{’followedbyacharacterotherthanadigitisanordinarycharacter,notthebeginningofabound(!).It

is illegal to end an RE with ‘\’.
 

方括号 [ ] 中可以指定一个字符的集合,并且不能是空集合。

如果这个集合以 ^ 开头,那么表示不匹配该集合中的字符。

类似 a-z 的形式可以指定字符的范围,但是 a-c-e 这种形式是非法的。

比如 [0-9] 表示匹配数字字符,[^0-9] 表示不匹配数字字符。

man 7 regex 写道
A bracket expression is a list of characters enclosed in ‘[]’. It normally matches any single character from

thelist(butseebelow).Ifthelistbeginswith‘^’,itmatchesanysinglecharacter(butseebelow)not

fromtherestofthelist.Iftwocharactersinthelistareseparatedby‘-’,thisisshorthandforthefull

rangeofcharactersbetweenthosetwo(inclusive)inthecollatingsequence,e.g.‘[0-9]’inASCIImatchesany

decimaldigit.Itisillegal(!)fortworangestoshareanendpoint,e.g.‘a-c-e’.Rangesareverycollating-

sequence-dependent, and portable programs should avoid relying on them.
 

在 [ ] 中,

如果字符集合需要包含 ] 呢,可以写成 []],即 ]为集合中的第一个字符,而 [^]] 表示不匹配 ];

如果字符集合需要包含 - 呢,必须把 - 放在第一个字符的位置或者最后一个字符的位置,[-] 表示匹配 -,[^-] 表示不匹配 - ;

man 7 regex 写道
To include a literal ‘]’ in the list, make it the first character (following a possible ‘^’). To include a

literal‘-’,makeitthefirstorlastcharacter,orthesecondendpointofarange.Tousealiteral‘-’as

thefirstendpointofarange,encloseitin‘[.’and‘.]’tomakeitacollatingelement(seebelow).With

theexceptionoftheseandsomecombinationsusing‘[’(seenextparagraphs),allotherspecialcharacters,

including ‘\’, lose their special significance within a bracket expression.
 

关于多字符序列,形式为 [.chars.],比如 [[.ch.,]] 可以匹配 ch。

man 7 regex 写道
Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if

itwereasinglecharacter,oracollating-sequencenameforeither)enclosedin‘[.’and‘.]’standsforthe

sequenceofcharactersofthatcollatingelement.Thesequenceisasingleelementofthebracketexpression’s

list.Abracketexpressioncontainingamulti-charactercollatingelementcanthusmatchmorethanonecharac-

ter,e.g.ifthecollatingsequenceincludesa‘ch’collatingelement,thentheRE‘[[.ch.]]*c’matchesthe

first five characters of ‘chchcc’.
 

关于等价类,形式为 [=c=],但这个等价类目前我还没有明白怎么用法。

man 7 regex 写道
Within a bracket expression, a collating element enclosed in ‘[=’ and ‘=]’ is an equivalence class, standing

forthesequencesofcharactersofallcollatingelementsequivalenttothatone,includingitself.(Ifthere

arenootherequivalentcollatingelements,thetreatmentisasiftheenclosingdelimiterswere‘[.’and

‘.]’.)Forexample,ifoand^arethemembersofanequivalenceclass,then‘[[=o=]]’,‘[[=^=]]’,and‘[o^]’

are all synonymous. An equivalence class may not(!) be an endpoint of a range.
 

在 [ ] 中,可以指定字符类,形式为 [:class:],比如 [[:digit:]] 匹配数字,[[:alpha:]] 匹配字母,常用的标准字符类如下:

alnum    字母和数字

alpha     字母

blank     空白,包括空格、制表符等

digit      数字

lower     小写字母

space   空白,包括空格、制表符、竖向制表符、换行、回车,注意与 blank 类的区别

upper    大写字母

xdigit    十六进制数字字符

这些字符类的判断方式与C语言中的字符类判断是一样的,比如在C语言中用 isalpha(c) 来判断是否字母,以此类推。

man 7 regex 写道
       Within a bracket expression, the name of a character class enclosed in ‘[:’ and ‘:]’ stands for the list of all

charactersbelongingtothatclass.Standardcharacterclassnamesare:

alnumdigitpunct

alphagraphspace

blanklowerupper

cntrlprintxdigit

Thesestandforthecharacterclassesdefinedinwctype(3).Alocalemayprovideothers.Acharacterclass

       may not be used as an endpoint of a range.
 

C语言中关于字符类的判断函数说明。

man 3 isalpha 写道
       isalnum()

checksforanalphanumericcharacter;itisequivalentto(isalpha(c)||isdigit(c)).

isalpha()

checksforanalphabeticcharacter;inthestandard"C"locale,itisequivalentto(isupper(c)||

islower(c)).Insomelocales,theremaybeadditionalcharactersforwhichisalpha()istrue—letters

whichareneitheruppercasenorlowercase.

isascii()

checkswhethercisa7-bitunsignedcharvaluethatfitsintotheASCIIcharacterset.

isblank()

checksforablankcharacter;thatis,aspaceoratab.

iscntrl()

checksforacontrolcharacter.

isdigit()

checksforadigit(0through9).

isgraph()

checksforanyprintablecharacterexceptspace.

islower()

checksforalower-casecharacter.

isprint()

checksforanyprintablecharacterincludingspace.

ispunct()

checksforanyprintablecharacterwhichisnotaspaceoranalphanumericcharacter.

isspace()

checksforwhite-spacecharacters.Inthe"C"and"POSIX"locales,theseare:space,form-feed( ?[1m\f ?,

newline( ?[1m\n ?,carriagereturn( ?[1m\r ?,horizontaltab( ?[1m\t ?,andverticaltab( ?[1m\v ?.

isupper()

checksforanuppercaseletter.

isxdigit()

checksforahexadecimaldigits,i.e.oneof

0123456789abcdefABCDEF.

 

要注意的是,POSIX.2正则表达式不支持类似Java中的字符类的写法,比如在Java中 \d表示匹配数字,\w表示匹配字母数字下划线。

正则表达式的匹配,从字符串中最早匹配的位置开始,到最长匹配结束,是匹配的长度越长越好,即贪婪匹配。

man 7 regex 写道
In the event that an RE could match more than one substring of a given string, the RE matches the one starting

earliestinthestring.IftheREcouldmatchmorethanonesubstringstartingatthatpoint,itmatchesthe

longest.Subexpressionsalsomatchthelongestpossiblesubstrings,subjecttotheconstraintthatthewhole

matchbeaslongaspossible,withsubexpressionsstartingearlierintheREtakingpriorityoveronesstarting

later.Notethathigher-levelsubexpressionsthustakepriorityovertheirlower-levelcomponentsubexpres-

sions.
 

匹配长度以字符数计算。即使只匹配空串,也被认为比完全不匹配要长。比如:

bb* 匹配 abbbc 的中间三个字符;

(wee|week)(khights|nights)  匹配 weeknights 整个串;

(.*).*  匹配 abc,其中(.*) 匹配 abc,剩下的 .* 匹配空串;

(a*)*  匹配 bc,其中 (a*)* 和 (a*) 都只匹配空串。

man 7 regex 写道
Match lengths are measured in characters, not collating elements. A null string is considered longer than no

matchatall.Forexample,‘bb*’matchesthethreemiddlecharactersof‘abbbc’,‘(wee|week)(knights|nights)’

matchesalltencharactersof‘weeknights’,when‘(.*).*’ismatchedagainst‘abc’theparenthesizedsubexpres-

sionmatchesallthreecharacters,andwhen‘(a*)*’ismatchedagainst‘bc’boththewholeREandtheparenthe-

sized subexpression match the null string.
 

关于不区分大小匹配的说明。x 匹配 x和X,相当于 [xX],而 [^x] 相当于 [^xX]。

man 7 regex 写道
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the

alphabet.Whenanalphabeticthatexistsinmultiplecasesappearsasanordinarycharacteroutsideabracket

expression,itiseffectivelytransformedintoabracketexpressioncontainingbothcases,e.g.‘x’becomes

‘[xX]’.Whenitappearsinsideabracketexpression,allcasecounterpartsofitareaddedtothebracket

expression, so that (e.g.) ‘[x]’ becomes ‘[xX]’ and ‘[^x]’ becomes ‘[^xX]’.
 

正则表达式的长度限制,一般不超过256字节,但具体实现也可以不限定长度。

man 7 regex 写道
No particular limit is imposed on the length of REs(!). Programs intended to be portable should not employ REs longer than 256 bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant.
 

最后来讲 Obsolete RE (或 basic RE) 与 前面的 modern RE (或 extended RE)的区别:

在 basic RE 中, 竖线(|)、加号(+)、问号(?)是普通字符;

范围用 \{ \} 来表示,而 { } 只是普通字符;

子表达式用 \( \) 来表示,而 ( ) 只是普通字符;

当^不是开头、$不是结尾、*在开头时,它们是普通字符;

\非0数字的形式表示对前面匹配的子串的引用,比如 \([bc]\)\1 匹配 bb 或 cc,但不匹配 bc 。

man 7 regex 写道
Obsolete (‘‘basic’’) regular expressions differ in several respects. ‘|’, ‘+’, and ‘?’ are ordinary characters

andthereisnoequivalentfortheirfunctionality.Thedelimitersforboundsare‘\{’and‘\}’,with‘{’and

‘}’bythemselvesordinarycharacters.Theparenthesesfornestedsubexpressionsare‘\(’and‘\)’,with‘(’

and‘)’bythemselvesordinarycharacters.‘^’isanordinarycharacterexceptatthebeginningoftheRE

or(!)thebeginningofaparenthesizedsubexpression,‘$’isanordinarycharacterexceptattheendoftheRE

or(!)theendofaparenthesizedsubexpression,and‘*’isanordinarycharacterifitappearsatthebeginning

oftheREorthebeginningofaparenthesizedsubexpression(afterapossibleleading‘^’).Finally,thereis

onenewtypeofatom,abackreference:‘\’followedbyanon-zerodecimaldigitdmatchesthesamesequenceof

charactersmatchedbythedthparenthesizedsubexpression(numberingsubexpressionsbythepositionsoftheir

opening parentheses, left to right), so that (e.g.) ‘\([bc]\)\1’ matches ‘bb’ or ‘cc’ but not ‘bc’.
 

重复前面说过的:“确认一个正则表达式是否正确的唯一方法就是去测试它”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
The only way to be certain that a particular RE works is to test it.