也就是说,如何从文本(xml/txt,独立于编码)文件中记录存档(jar/rar/etc.)文件?
没有保证的方法,但这里有几种可能性:
1)在文件上查找标题.遗憾的是,标题是特定于文件的,因此虽然您可能会发现它是一个RAR文件,但您无法获得更为通用的答案,无论是文本还是二进制文件.
2)计算字符与非字符类型的数量.文本文件主要是字母字符,而二进制文件 - 尤其是rar,zip等压缩文件 - 往往会更均匀地表示字节.
3)寻找定期重复的换行模式.
跑file -bi {filename}
.如果它返回的是以'text /'开头的,则它是非二进制的,否则就是.;-)
我做了这个.有点简单,但对于基于拉丁语言,它应该工作正常,比率调整.
/** * Guess whether given file is binary. Just checks for anything under 0x09. */ public static boolean isBinaryFile(File f) throws FileNotFoundException, IOException { FileInputStream in = new FileInputStream(f); int size = in.available(); if(size > 1024) size = 1024; byte[] data = new byte[size]; in.read(data); in.close(); int ascii = 0; int other = 0; for(int i = 0; i < data.length; i++) { byte b = data[i]; if( b < 0x09 ) return true; if( b == 0x09 || b == 0x0A || b == 0x0C || b == 0x0D ) ascii++; else if( b >= 0x20 && b <= 0x7E ) ascii++; else other++; } if( other == 0 ) return false; return 100 * other / (ascii + other) > 95; }
看看JMimeMagic库.
jMimeMagic是一个用于确定文件或流的MIME类型的Java库.
使用Java 7 Files类http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)
boolean isBinaryFile(File f) throws IOException { String type = Files.probeContentType(f.toPath()); if (type == null) { //type couldn't be determined, assume binary return true; } else if (type.startsWith("text")) { return false; } else { //type isn't text return true; } }
我使用了这段代码,它适用于英语和德语文本:
private boolean isTextFile(String filePath) throws Exception { File f = new File(filePath); if(!f.exists()) return false; FileInputStream in = new FileInputStream(f); int size = in.available(); if(size > 1000) size = 1000; byte[] data = new byte[size]; in.read(data); in.close(); String s = new String(data, "ISO-8859-1"); String s2 = s.replaceAll( "[a-zA-Z0-9ßöäü\\.\\*!\"§\\$\\%&/()=\\?@~'#:,;\\"+ "+><\\|\\[\\]\\{\\}\\^°²³\\\\ \\n\\r\\t_\\-`´âêîô"+ "ÂÊÔÎáéíóàèìòÁÉÍÓÀÈÌÒ©‰¢£¥€±¿»«¼½¾™ª]", ""); // will delete all text signs double d = (double)(s.length() - s2.length()) / (double)(s.length()); // percentage of text signs in the text return d > 0.95; }