首页 文章

使用StAX读取所有文本元素

提问于
浏览
0

我需要解析一个xml文件,无论其中的标签是什么,并读取其所有叶子的文本(仅文本元素) . 我正在使用StAX,但似乎没有办法事先知道元素只是文本(因此getElementText会抛出异常,因为不会留下元素) . 所以我决定使用过滤器,仅过滤标签元素,并以这种方式迭代抛出文档:

InputStream in = null;
    try {
        in = new FileInputStream("file.xml");
        DatiEstratti de = DatiEstratti.getInstance();

        // Processamento ad eventi
        XMLInputFactory factory = (XMLInputFactory) XMLInputFactory.newInstance();

        XMLEventReader eventReader = factory.createXMLEventReader(in);
        // usa il filtro per filtrare solo i tag element
        eventReader = factory.createFilteredReader(eventReader, new ElementOnlyFilter());

        while (eventReader.hasNext()) {

            XMLEvent event = eventReader.nextEvent();

            if (event.getEventType() == XMLStreamConstants.START_ELEMENT) {
                StartElement startElement = event.asStartElement();

                XMLEvent peekEvent = eventReader.peek();
                if(peekEvent.isEndElement()){
                    // questa è la prima volta che viene fatto un pop
                    // quindi è una foglia.
                    // recupera il dato.
                    String value = eventReader.getElementText();

                    logger.info("dato : " + value);
                }


                String nome = startElement.getName().getLocalPart();
                String prefix = startElement.getName().getPrefix();
                if (prefix != null) {
                    nome = prefix + ":" + nome;
                }
                de.push(nome);
                logger.info("push : " + de.stampaPercorso());



            } else if ((event.getEventType() == XMLStreamConstants.END_ELEMENT)) {

                de.pop();
                logger.info("pop : " + de.stampaPercorso());
                if (0 > de.nLivelliPercorso()) {
                    break;
                }
            }
            //handle more event types here...
        }

...过滤器的位置是:

public class ElementOnlyFilter implements EventFilter, StreamFilter {

/* implementation of EventFilter interface */
@Override
public boolean accept(XMLEvent event) {
    return acceptInternal(event.getEventType(  ));
}

/* implementation of StreamFilter interface */
@Override
public boolean accept(XMLStreamReader reader) {
    return acceptInternal(reader.getEventType(  ));
}

/* internal utility method */
private boolean acceptInternal(int eventType) {
    return eventType == XMLStreamConstants.START_ELEMENT
            || eventType == XMLStreamConstants.END_ELEMENT;
}

}

问题是,当发现请假时,我得到以下异常:

javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3,42]
Message: parser must be on START_ELEMENT to read next text
    at com.sun.xml.internal.stream.XMLEventReaderImpl.getElementText(XMLEventReaderImpl.java:114)
    at javax.xml.stream.util.EventReaderDelegate.getElementText(EventReaderDelegate.java:88)
    at xmlparser.XmlParser.main(XmlParser.java:63)

我不知道 . 这段代码有问题吗?我认为peek()不会改变读者,所以getElementText()应该由start元素调用 . 还有另一种方法来实现我的目标吗?

1 回答

  • 4

    首先,如果过滤只包含开始和结束元素事件,那么根本不会看到叶子节点中包含的文本 . 我会使用不同的方法,使用未经过滤的流,如下所示:

    XMLEventReader eventReader = factory.createXMLEventReader(in);
    StringBuilder content = null;
    while(eventReader.hasNext()) {
      XMLEvent event = eventReader.nextEvent();
      if(event.isStartElement()) {
        // other start element processing here
        content = new StringBuilder();
      } else if(event.isEndElement()) {
        if(content != null) {
          // this was a leaf element
          String leafText = content.toString();
          // do something with the leaf node
        } else {
          // not a leaf
        }
        // in all cases, discard content
        content = null;
      } else if(event.isCharacters()) {
        if(content != null) {
          content.append(event.asCharacters().getData());
        }
      }
      // other event types here
    }
    

    诀窍是结束元素部分末尾的 content = null - 如果 content 非空,则在进入 if(event.isEndElement()) 块时,你知道在这个和它对应的开始标记之间没有介入的结束元素事件,即它是一片叶子节点 .

相关问题