我开始觉得使用正则表达式会降低代码的可维护性.正则表达式的简洁性和强大功能有些恶意.Perl将其与副作用(如默认运算符)相结合.
我有习惯记录正则表达式,至少有一个句子给出基本意图,至少有一个匹配的例子.
因为构建了正则表达式,所以我觉得对表达式中每个元素的最大组件进行注释是绝对必要的.尽管如此,即便是我自己的正则表达式让我摸不着头脑,好像我在读克林贡一样.
你故意愚弄你的正则表达式吗?你是否将可能更短,更强大的那些分解成更简单的步骤?我放弃了嵌套正则表达式.是否存在由于可维护性问题而避免的正则表达式构造?
不要让这个例子覆盖这个问题.
如果迈克尔·艾什的下面有一些错误,你会有什么可以做任何事情,但完全扔掉它?
^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
根据请求,可以使用上面的Ash先生的链接找到确切的目的.
比赛 01.1.02 | 11-30-2001 | 2000年2月29日
非比赛 02/29/01 | 13/01/2002 | 11/00/02
使用Expresso,它给出了正则表达式的分层,英语细分.
要么
这提示由达伦Neimke:
.NET允许通过RegExOptions.IgnorePatternWhitespace编译器选项和嵌入在模式字符串的每一行中的(?#...)语法,使用嵌入式注释创建正则表达式模式.
这允许在每行中嵌入类似psuedo-code的注释,并对可读性产生以下影响:
Dim re As New Regex ( _ "(?<= (?# Start a positive lookBEHIND assertion ) " & _ "(#|@) (?# Find a # or a @ symbol ) " & _ ") (?# End the lookBEHIND assertion ) " & _ "(?= (?# Start a positive lookAHEAD assertion ) " & _ " \w+ (?# Find at least one word character ) " & _ ") (?# End the lookAHEAD assertion ) " & _ "\w+\b (?# Match multiple word characters leading up to a word boundary)", _ RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _ )
这是另一个.NET示例(需要RegexOptions.Multiline
和RegexOptions.IgnorePatternWhitespace
选项):
static string validEmail = @"\b # Find a word boundary (?# Begin group: Username [a-zA-Z0-9._%+-]+ # Characters allowed in username, 1 or more ) # End group: Username @ # The e-mail '@' character (? # Begin group: Domain name [a-zA-Z0-9.-]+ # Domain name(s), we include a dot so that # mail.somewhere is also possible .[a-zA-Z]{2,4} # The top level domain can only be 4 characters # So .info works, .telephone doesn't. ) # End group: Domain name \b # Ending on a word boundary ";
如果您的RegEx适用于常见问题,则另一种选择是将其记录并提交给RegExLib,在RegExLib中对其进行评级和评论.什么都不比许多眼睛好......
另一个RegEx工具是The Regulator
我通常只是尝试将所有正则表达式调用包含在自己的函数中,并使用有意义的名称和一些基本注释.我喜欢将正则表达式视为只写语言,只能由编写它的人阅读(除非它非常简单).我完全期望有人可能需要完全重写表达式,如果他们必须改变其意图,这可能是为了更好地保持正则表达式训练活着.
好吧,PCRE/x修饰符的整个生命目的是让你更可读地编写正则表达式,就像在这个简单的例子中一样:
my $expr = qr/ [a-z] # match a lower-case letter \d{3,5} # followed by 3-5 digits /x;
有些人将RE用于错误的东西(我正在等待关于如何使用单个RE检测有效C++程序的第一个SO问题).
我经常发现,如果我不能将我的RE放在60个字符以内,最好不要成为一段代码,因为这几乎总是更具可读性.
无论如何,我总是在代码中记录RE应该实现的内容,非常详细.这是因为我知道,从痛苦的经历来看,对于其他人(甚至是我,六个月后)进入并试图理解是多么困难.
我不相信他们是邪恶的,虽然我相信一些使用它们的人是邪恶的(不是看着你,Michael Ash :-).它们是一个很好的工具,但是,就像电锯一样,如果你不知道如何正确使用它们,你会剪断你的腿.
更新:实际上,我刚刚跟踪了那个怪物的链接,它是为了验证1600年到999年之间的m/d/y格式日期.这是一个经典案例,其中完整的代码将更易读和可维护.
您只需将其拆分为三个字段并检查各个值.如果我的一个仆从买了这个,我几乎认为这是一个值得终止的罪行.我当然会把它们送回来正确写出来.
这是同样的正则表达式分解成易消化的碎片.除了更具可读性之外,一些子正则表达式本身也很有用.更改允许的分隔符也更加容易.
#!/usr/local/ActivePerl-5.10/bin/perl use 5.010; #only 5.10 and above use strict; use warnings; my $sep = qr{ [/.-] }x; #allowed separators my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century my $any_decade = qr/ [0-9]{2} /x; #match any decade or 2 digit year my $any_year = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year #match the 1st through 28th for any month of any year my $start_of_month = qr/ (?: #match 0?[1-9] | #Jan - Sep or 1[0-2] #Oct - Dec ) ($sep) #the separator (?: 0?[1-9] | # 1st - 9th or 1[0-9] | #10th - 19th or 2[0-8] #20th - 28th ) \g{-1} #and the separator again /x; #match 28th - 31st for any month but Feb for any year my $end_of_month = qr/ (?: (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec ($sep) #the separator 31 #the 31st \g{-1} #and the separator again | #or (?: 0?[13-9] | 1[0-2] ) #match all months but Feb ($sep) #the separator (?:29|30) #the 29th or the 30th \g{-1} #and the separator again ) /x; #match any non-leap year date and the first part of Feb in leap years my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x; #match 29th of Feb in leap years #BUG: 00 is treated as a non leap year #even though 2000, 2400, etc are leap years my $feb_in_leap = qr/ 0?2 #match Feb ($sep) #the separtor 29 #the 29th \g{-1} #the separator again (?: $any_century? #any century (?: #and decades divisible by 4 but not 100 0[48] | [2468][048] | [13579][26] ) | (?: #or match centuries that are divisible by 4 16 | [2468][048] | [3579][26] ) 00 ) /x; my $any_date = qr/$non_leap_year|$feb_in_leap/; my $only_date = qr/^$any_date$/; say "test against garbage"; for my $date (qw(022900 foo 1/1/1)) { say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match"; } say ''; #comprehensive test my @code = qw/good unmatch month day year leap/; for my $sep (qw( / - . )) { say "testing $sep"; my $i = 0; for my $y ("00" .. "99", 1600 .. 9999) { say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850; for my $m ("00" .. "09", 0 .. 13) { for my $d ("00" .. "09", 1 .. 31) { my $date = join $sep, $m, $d, $y; my $re = $date ~~ $only_date || 0; my $code = not_valid($date); unless ($re == !$code) { die "error $date re $re code $code[$code]\n" } } } } } sub not_valid { state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]; my $date = shift; my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)}; return 1 unless defined $m; #if $m is set, the rest will be too #components are in roughly the right ranges return 2 unless $m >= 1 and $m <= 12; return 3 unless $d >= 1 and $d <= $end->[$m]; return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999); #handle the non leap year case return 5 if $m == 2 and $d == 29 and not leap_year($y); return 0; } sub leap_year { my $y = shift; $y = "19$y" if $y < 1600; return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400; return 0; }