我想颠倒正则表达式.即给出一个正则表达式,我想生成任何匹配该正则表达式的字符串.
我知道如何使用有限状态机从理论计算机科学背景中做到这一点,但我只想知道是否有人已经编写了一个库来执行此操作.:)
我正在使用Python,所以我想要一个Python库.
重申一下,我只想要一个与正则表达式匹配的字符串.像 "." 或者".*"会使无限量的字符串与正则表达式匹配,但我并不关心所有选项.
我愿意这个库只适用于正则表达式的某个子集.
其他人在这里有一个类似的(重复?)问题,我想提供一个小助手库,用于生成我一直在研究的Python随机字符串.
它包含一个方法,xeger()
允许您从正则表达式创建一个字符串:
>>> import rstr >>> rstr.xeger(r'[A-Z]\d[A-Z] \d[A-Z]\d') u'M5R 2W4'
现在,它适用于大多数基本的正则表达式,但我确信它可以改进.
虽然我在这方面没有多大意义,但这里有:
import re import string def traverse(tree): retval = '' for node in tree: if node[0] == 'any': retval += 'x' elif node[0] == 'at': pass elif node[0] in ['min_repeat', 'max_repeat']: retval += traverse(node[1][2]) * node[1][0] elif node[0] == 'in': if node[1][0][0] == 'negate': letters = list(string.ascii_letters) for part in node[1][1:]: if part[0] == 'literal': letters.remove(chr(part[1])) else: for letter in range(part[1][0], part[1][1]+1): letters.remove(chr(letter)) retval += letters[0] else: if node[1][0][0] == 'range': retval += chr(node[1][0][1][0]) else: retval += chr(node[1][0][1]) elif node[0] == 'not_literal': if node[1] == 120: retval += 'y' else: retval += 'x' elif node[0] == 'branch': retval += traverse(node[1][1][0]) elif node[0] == 'subpattern': retval += traverse(node[1][1]) elif node[0] == 'literal': retval += chr(node[1]) return retval print traverse(re.sre_parse.parse(regex).data)
我从正则表达式语法中取出了所有内容- 这似乎是一个合理的子集 - 我忽略了一些细节,比如行结尾.错误处理等留给读者作为练习.
在正则表达式中的12个特殊字符中,我们可以完全忽略6个(即使它们应用的原子也是2个),4.5导致一个简单的替换,1.5让我们实际思考.
我认为,由此产生的结果并不是非常有趣.
我不知道有任何模块可以做到这一点.如果你在Cookbook或PyPI中没有找到这样的东西,你可以尝试使用(未记录的)re.sre_parse模块自己滚动.这可能有助于您入门:
In [1]: import re In [2]: a = re.sre_parse.parse("[abc]+[def]*\d?z") In [3]: a Out[3]: [('max_repeat', (1, 65535, [('in', [('literal', 97), ('literal', 98), ('literal', 99)])])), ('max_repeat', (0, 65535, [('in', [('literal', 100), ('literal', 101), ('literal', 102)])])), ('max_repeat', (0, 1, [('in', [('category', 'category_digit')])])), ('literal', 122)] In [4]: eval(str(a)) Out[4]: [('max_repeat', (1, 65535, [('in', [('literal', 97), ('literal', 98), ('literal', 99)])])), ('max_repeat', (0, 65535, [('in', [('literal', 100), ('literal', 101), ('literal', 102)])])), ('max_repeat', (0, 1, [('in', [('category', 'category_digit')])])), ('literal', 122)] In [5]: a.dump() max_repeat 1 65535 in literal 97 literal 98 literal 99 max_repeat 0 65535 in literal 100 literal 101 literal 102 max_repeat 0 1 in category category_digit literal 122
除非你的正则表达式非常简单(即没有星星或加号),否则会有无限多的字符串与之匹配.如果你的正则表达式只涉及连接和交替,那么你可以将每个交替扩展到它的所有可能性,例如(foo|bar)(baz|quux)
可以扩展到列表中['foobaz', 'fooquux', 'barbaz', 'barquux']
.
当其他答案使用re引擎解析元素时,我鞭打了自己的元素以解析re并返回将匹配的最小模式。(请注意,它不处理[^ ads],精美的分组结构,行首/行尾特殊字符)。如果您真的喜欢,我可以提供单元测试:)
import re class REParser(object): """Parses an RE an gives the least greedy value that would match it""" def parse(self, parseInput): re.compile(parseInput) #try to parse to see if it is a valid RE retval = "" stack = list(parseInput) lastelement = "" while stack: element = stack.pop(0) #Read from front if element == "\\": element = stack.pop(0) element = element.replace("d", "0").replace("D", "a").replace("w", "a").replace("W", " ") elif element in ["?", "*"]: lastelement = "" element = "" elif element == ".": element = "a" elif element == "+": element = "" elif element == "{": arg = self._consumeTo(stack, "}") arg = arg[:-1] #dump the } arg = arg.split(",")[0] #dump the possible , lastelement = lastelement * int(arg) element = "" elif element == "[": element = self._consumeTo(stack, "]")[0] # just use the first char in set if element == "]": #this is the odd case of []] self._consumeTo(stack, "]") # throw rest away and use ] as first element elif element == "|": break # you get to an | an you have all you need to match elif element == "(": arg = self._consumeTo(stack, ")") element = self.parse( arg[:-1] ) retval += lastelement lastelement = element retval += lastelement #Complete the string with the last char return retval def _consumeTo(self, stackToConsume, endElement ): retval = "" while not retval.endswith(endElement): retval += stackToConsume.pop(0) return retval