似乎应该有一个比以下更简单的方法:
import string s = "string. With. Punctuation?" # Sample string out = s.translate(string.maketrans("",""), string.punctuation)
在那儿?
从效率的角度来看,你不会打败
s.translate(None, string.punctuation)
它使用查找表在C中执行原始字符串操作 - 除了编写自己的C代码之外,没有什么能比这更好.
如果速度不是担心,另一个选择是:
s.translate(str.maketrans('', '', string.punctuation))
这比使用每个char的s.replace更快,但是不能像非纯python方法那样执行,例如regexes或string.translate,正如您可以从下面的时间看到的那样.对于这种类型的问题,尽可能低的水平做到这一点是值得的.
时间码:
exclude = set(string.punctuation) s = ''.join(ch for ch in s if ch not in exclude)
这给出了以下结果:
import re, string, timeit s = "string. With. Punctuation" exclude = set(string.punctuation) table = string.maketrans("","") regex = re.compile('[%s]' % re.escape(string.punctuation)) def test_set(s): return ''.join(ch for ch in s if ch not in exclude) def test_re(s): # From Vinko's solution, with fix. return regex.sub('', s) def test_trans(s): return s.translate(table, string.punctuation) def test_repl(s): # From S.Lott's solution for c in string.punctuation: s=s.replace(c,"") return s print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000) print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000) print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000) print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)
正则表达式很简单,如果你知道的话.
import re s = "string. With. Punctuation?" s = re.sub(r'[^\w\s]','',s)
在上面的代码中,我们用空字符串替换(re.sub)所有NON [字母数字字符(\ w)和空格(\ s)].
因此.和?通过正则表达式运行s变量后,变量's'中不会出现标点符号.
为了方便使用,我总结了Python 2和Python 3中字符串条带标点符号的注释.请参阅其他答案以获取详细说明.
Python 2
import string s = "string. With. Punctuation?" table = string.maketrans("","") new_s = s.translate(table, string.punctuation) # Output: string without punctuation
Python 3
import string s = "string. With. Punctuation?" table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation} new_s = s.translate(table) # Output: string without punctuation
myString.translate(None, string.punctuation)
我经常使用这样的东西:
>>> s = "string. With. Punctuation?" # Sample string >>> import string >>> for c in string.punctuation: ... s= s.replace(c,"") ... >>> s 'string With Punctuation'
string.punctuation
是ASCII 只!更正确(但也更慢)的方法是使用unicodedata模块:
# -*- coding: utf-8 -*- from unicodedata import category s = u'String — with - «punctation »...' s = ''.join(ch for ch in s if category(ch)[0] != 'P') print 'stripped', s
如果你对家庭更熟悉,不一定更简单,但不一样.
import re, string s = "string. With. Punctuation?" # Sample string out = re.sub('[%s]' % re.escape(string.punctuation), '', s)
对于Python 3 str
或Python 2 unicode
值,str.translate()
只需要一个字典; 在该映射中查找代码点(整数),并None
删除映射到的任何内容.
要删除(某些?)标点符号,请使用:
import string remove_punct_map = dict.fromkeys(map(ord, string.punctuation)) s.translate(remove_punct_map)
所述dict.fromkeys()
类方法使得它琐碎创建映射,所有的值设置为None
基于密钥的序列.
要删除所有标点符号,而不仅仅是ASCII标点符号,您的表格需要更大一些; 请参阅JF Sebastian的回答(Python 3版本):
import unicodedata import sys remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
string.punctuation
错过了现实世界中常用的标点符号.如何使用适用于非ASCII标点符号的解决方案?
import regex s = u"string. With. Some?Really Weird?Non?ASCII? ??Punctuation???" remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE) remove.sub(u" ", s).strip()
就个人而言,我认为这是从Python中删除字符串标点符号的最佳方法,因为:
它删除所有Unicode标点符号
它很容易修改,例如你可以删除\{S}
如果你想删除标点,但保持符号$
.
您可以非常具体地了解要保留的内容以及要删除的内容,例如,\{Pd}
只删除短划线.
这个正则表达式也规范了空白.它将标签,回车和其他奇怪的地方映射到漂亮的单个空间.
这使用Unicode字符属性,您可以在维基百科上阅读更多信息.
这是Python 3.5的单行程序:
import string "l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))
我还没有看到这个答案.只需使用正则表达式; 它除了单词字符(\w
)和数字字符(\d
)之外的所有字符,后跟一个空白字符(\s
):
import re s = "string. With. Punctuation?" # Sample string out = re.sub(ur'[^\w\d\s]+', '', s)
这可能不是最好的解决方案,但这就是我做到的.
import string f = lambda x: ''.join([i for i in x if i not in string.punctuation])
这是我写的一个函数.它不是很有效,但它很简单,您可以添加或删除任何您想要的标点符号:
def stripPunc(wordList): """Strips punctuation from list of words""" puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""] for punc in puncList: for word in wordList: wordList=[word.replace(punc,'') for word in wordList] return wordList
import re s = "string. With. Punctuation?" # Sample string out = re.sub(r'[^a-zA-Z0-9\s]', '', s)
作为更新,我重写了Python 3中的@Brian示例,并对其进行了更改,以将regex编译步骤移至函数内部。我的想法是计时使该功能起作用所需的每个步骤。也许您使用的是分布式计算,并且您的工作人员之间无法共享正则表达式对象,因此需要re.compile
在每个工作人员中走一步。另外,我很好奇地为Python 3的maketrans的两种不同实现计时了
table = str.maketrans({key: None for key in string.punctuation})
与
table = str.maketrans('', '', string.punctuation)
另外,我添加了另一种使用set的方法,其中利用了交集函数来减少迭代次数。
这是完整的代码:
import re, string, timeit s = "string. With. Punctuation" def test_set(s): exclude = set(string.punctuation) return ''.join(ch for ch in s if ch not in exclude) def test_set2(s): _punctuation = set(string.punctuation) for punct in set(s).intersection(_punctuation): s = s.replace(punct, ' ') return ' '.join(s.split()) def test_re(s): # From Vinko's solution, with fix. regex = re.compile('[%s]' % re.escape(string.punctuation)) return regex.sub('', s) def test_trans(s): table = str.maketrans({key: None for key in string.punctuation}) return s.translate(table) def test_trans2(s): table = str.maketrans('', '', string.punctuation) return(s.translate(table)) def test_repl(s): # From S.Lott's solution for c in string.punctuation: s=s.replace(c,"") return s print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)) print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000)) print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)) print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)) print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000)) print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
这是我的结果:
sets : 3.1830138750374317 sets2 : 2.189873124472797 regex : 7.142953420989215 translate : 4.243278483860195 translate2 : 2.427158243022859 replace : 4.579746678471565