如何从字符串中删除MySQL的utf8字符集不支持的字符?换句话说,只有MySQL的utf8mb4字符集支持的具有四个字节的字符,例如"" .
例如,
C = -2.4‰ ± 0.3‰; H = -57‰
应该成为
C = -2.4‰ ± 0.3‰; H = -57‰
我想将数据文件加载到具有的MySQL表中CHARSET=utf8
.
MySQL的utf8mb4
编码是世界所称的UTF-8
.
MySQL的utf8
编码是UTF-8
其中仅支持BMP中的字符的子集(包括字符U + 0000到U + FFFF).
参考
因此,以下内容将匹配不支持的字符:
/[^\N{U+0000}-\N{U+FFFF}]/
您可以使用以下三种不同的技术来清理输入:
1:删除不支持的字符:
s/[^\N{U+0000}-\N{U+FFFF}]//g;
2:用U + FFFD替换不支持的字符:
s/[^\N{U+0000}-\N{U+FFFF}]/\N{REPLACEMENT CHARACTER}/g;
3:使用翻译地图替换不支持的字符:
my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg;
例如,
use utf8; # Source code is encoded using UTF-8 use open ':std', ':encoding(UTF-8)'; # Terminal and files use UTF-8. use strict; use warnings; use 5.010; # say, // use charnames ':full'; # Not needed in 5.16+ my %translations = ( "\N{MATHEMATICAL ITALIC SMALL EPSILON}" => "\N{GREEK SMALL LETTER EPSILON}", # ... ); $_ = "C = -2.4‰ ± 0.3‰; H = -57‰"; say; s{([^\N{U+0000}-\N{U+FFFF}])}{ $translations{$1} // "\N{REPLACEMENT CHARACTER}" }eg; say;
输出:
C = -2.4‰ ± 0.3‰; H = -57‰ ?C = -2.4‰ ± 0.3‰; ?H = -57‰