我需要在Python中执行一个简单的任务,即将字符串转换为全部小写并删除所有非ascii非字母字符.
例如:
"This is a Test" -> "thisisatest" "A235th@#$&( er Ra{}|?>ndom" -> "atherrandom"
我有一个简单的功能来做到这一点:
import string import sys def strip_string_to_lowercase(s): tmpStr = s.lower().strip() retStrList = [] for x in tmpStr: if x in string.ascii_lowercase: retStrList.append(x) return ''.join(retStrList)
但我不禁想到有更高效,更优雅的方式.
谢谢!
编辑:
感谢所有回答的人.我学会了,并且在某些情况下重新学习了很多python.
另一个解决方案(不是pythonic,但非常快)是使用string.translate - 但请注意,这不适用于unicode.值得注意的是,您可以通过将字符移动到一个集合(通过哈希查找,而不是每次执行线性搜索)来加速Dana的代码.以下是我给出的各种解决方案的时间安排:
import string, re, timeit # Precomputed values (for str_join_set and translate) letter_set = frozenset(string.ascii_lowercase + string.ascii_uppercase) tab = string.maketrans(string.ascii_lowercase + string.ascii_uppercase, string.ascii_lowercase * 2) deletions = ''.join(ch for ch in map(chr,range(256)) if ch not in letter_set) s="A235th@#$&( er Ra{}|?>ndom" # From unwind's filter approach def test_filter(s): return filter(lambda x: x in string.ascii_lowercase, s.lower()) # using set instead (and contains) def test_filter_set(s): return filter(letter_set.__contains__, s).lower() # Tomalak's solution def test_regex(s): return re.sub('[^a-z]', '', s.lower()) # Dana's def test_str_join(s): return ''.join(c for c in s.lower() if c in string.ascii_lowercase) # Modified to use a set. def test_str_join_set(s): return ''.join(c for c in s.lower() if c in letter_set) # Translate approach. def test_translate(s): return string.translate(s, tab, deletions) for test in sorted(globals()): if test.startswith("test_"): assert globals()[test](s)=='atherrandom' print "%30s : %s" % (test, timeit.Timer("f(s)", "from __main__ import %s as f, s" % test).timeit(200000))
这给了我:
test_filter : 2.57138351271 test_filter_set : 0.981806765698 test_regex : 3.10069885233 test_str_join : 2.87172979743 test_str_join_set : 2.43197956381 test_translate : 0.335367566218
[编辑]也更新了过滤器解决方案.(注意,使用set.__contains__
在这里有很大的不同,因为它避免了为lambda进行额外的函数调用.
>>> filter(str.isalpha, "This is a Test").lower() 'thisisatest' >>> filter(str.isalpha, "A235th@#$&( er Ra{}|?>ndom").lower() 'atherrandom'
不是特别的运行时效率,但在糟糕,疲惫的编码器眼睛肯定更好:
def strip_string_and_lowercase(s): return ''.join(c for c in s.lower() if c in string.ascii_lowercase)
我会:
小写字符串
更换所有[^a-z]
有""
像那样:
def strip_string_to_lowercase(): nonascii = re.compile('[^a-z]') return lambda s: nonascii.sub('', s.lower().strip())
编辑:事实证明原始版本(下面)非常慢,但通过将其转换为闭包(上图)可以获得一些性能.
def strip_string_to_lowercase(s): return re.sub('[^a-z]', '', s.lower().strip())
我的性能测量结果是针对字符串的100,000次迭代
"A235th@#$&( er Ra{}|?>ndom"
透露:
f_re_0 took 2672.000 ms
(这是这个答案的原始版本)
f_re_1 took 2109.000 ms
(这是上面显示的关闭版本)
f_re_2 took 2031.000 ms
(关闭版本,没有冗余strip()
)
f_fl_1 took 1953.000 ms
(放松filter
/ lambda
版本)
f_fl_2 took 1485.000 ms
(Coady的filter
版本)
f_jn_1 took 1860.000 ms
(达娜的join
版本)
为了测试,我没有print
得到结果.
translate
方法转换为小写并过滤非ascii非字母字符:
from string import ascii_letters, ascii_lowercase, maketrans table = maketrans(ascii_letters, ascii_lowercase*2) deletechars = ''.join(set(maketrans('','')) - set(ascii_letters)) print "A235th@#$&( er Ra{}|?>ndom".translate(table, deletechars) # -> 'atherrandom'
translate
方法过滤非ascii:
ascii_bytes = "A235th@#$&(???? er Ra{}|?>ndom".encode('ascii', 'ignore')
使用bytes.translate()
转换为小写和删除非字母字节:
from string import ascii_letters, ascii_lowercase alpha, lower = [s.encode('ascii') for s in [ascii_letters, ascii_lowercase]] table = bytes.maketrans(alpha, lower*2) # convert to lowercase deletebytes = bytes(set(range(256)) - set(alpha)) # delete nonalpha print(ascii_bytes.translate(table, deletebytes)) # -> b'atherrandom'