显然,当我使用正则表达式时,Java的正则表达式将变音符号和其他特殊字符计为非"单词字符".
"TESTÜTEST".replaceAll( "\\W", "" )
为我返回"TESTTEST".我想要的只是删除所有真正的非"单词字符".没有任何东西的任何方式做到这一点
"[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"
只是意识到我忘记了ô?
使用[^\p{L}\p{Nd}]+
- 这匹配既不是字母也不是(十进制)数字的所有(Unicode)字符.
在Java中:
String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");
编辑:
我改为\p{N}
,\p{Nd}
因为前者也匹配一些数字符号,如¼
; 后者没有.在regex101.com上查看.
当我碰到这个帖子时,我试图达到完全相反的目的.我知道它已经很老了,但这仍然是我的解决方案.您可以使用块,请参阅此处.在这种情况下,编译以下代码(使用正确的导入):
> String s = "äêìóblah"; > Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block > Matcher m = p.matcher(s); > System.out.println(m.find()); > System.out.println(s.replaceAll(p.pattern(), "#"));
您应该看到以下输出:
真正
#blah
最好,
有时您不想简单地删除字符,只需删除重音符号即可.我提出了以下实用程序类,每当我需要在URL中包含String时,我在Java REST Web项目中使用它:
import java.text.Normalizer; import java.text.Normalizer.Form; import org.apache.commons.lang.StringUtils; /** * Utility class for String manipulation. * * @author Stefan Haberl */ public abstract class TextUtils { private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" }; private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue", "sz" }; /** * Normalizes a String by removing all accents to original 127 US-ASCII * characters. This method handles German umlauts and "sharp-s" correctly * * @param s * The String to normalize * @return The normalized String */ public static String normalize(String s) { if (s == null) return null; String n = null; n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList); n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", ""); return n; } /** * Returns a clean representation of a String which might be used safely * within an URL. Slugs are a more human friendly form of URL encoding a * String. ** The method first normalizes a String, then converts it to lowercase and * removes ASCII characters, which might be problematic in URLs: *
作为德语演讲者,我也包括了对德语变音符号的正确处理 - 该列表应该易于扩展到其他语言.
HTH
编辑:请注意,将返回的String包含在URL中可能不安全.您至少应该对其进行HTML编码以防止XSS攻击.