我想使用我熟悉的语言 - Java,C#,Ruby,PHP,C/C++,尽管任何语言或伪代码的例子都非常受欢迎.
将大型XML文档拆分为仍然有效的XML的较小部分的最佳方法是什么?为了我的目的,我需要将它们分成大约三分之二或四分之一,但为了提供示例,将它们分成n个组件会很好.
使用DOM解析XML文档无法扩展.
这个Groovy -script使用StAX(Streaming API for XML)在顶层元素(与根文档的第一个子节点共享相同的QName)之间拆分XML文档.它非常快,处理任意大型文档,并且当您想要将大型批处理文件拆分为较小的块时非常有用.
在Java 6或StAX API上需要Groovy ,在CLASSPATH中需要Woodstox等实现
import javax.xml.stream.* pieces = 5 input = "input.xml" output = "output_%04d.xml" eventFactory = XMLEventFactory.newInstance() fileNumber = elementCount = 0 def createEventReader() { reader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(input)) start = reader.next() root = reader.nextTag() firstChild = reader.nextTag() return reader } def createNextEventWriter () { println "Writing to '${filename = String.format(output, ++fileNumber)}'" writer = XMLOutputFactory.newInstance().createXMLEventWriter(new FileOutputStream(filename), start.characterEncodingScheme) writer.add(start) writer.add(root) return writer } elements = createEventReader().findAll { it.startElement && it.name == firstChild.name }.size() println "Splitting ${elements} <${firstChild.name.localPart}> elements into ${pieces} pieces" chunkSize = elements / pieces writer = createNextEventWriter() writer.add(firstChild) createEventReader().each { if (it.startElement && it.name == firstChild.name) { if (++elementCount > chunkSize) { writer.add(eventFactory.createEndDocument()) writer.flush() writer = createNextEventWriter() elementCount = 0 } } writer.add(it) } writer.flush()