These examples illustrate how to use the igpp.xml/xmlgrep tool which is an XML aware grep tool that can search for elements (tags) that match a pattern as well as search the content of elements (tags) that match a pattern. The patterns are expressed with a regular expression syntax.
XML Parser, XPath generator and search tool.
Parses an XML file and flattens the document content. Values can be retrieved using a simplified XPath that can contain regular expressions for node names. All values in the XML file can be listed with a corresponding XPath.
usage: java igpp.xml.XMLGrep [options] file
-e,--extract {arg} | Extract. Extract all nodes with a given XPath. |
-f,--find {arg} | Find. Locate the value associated with an XPath. |
-h,--help | Display this text. |
-n,--nodes {arg} | Nodes. List all nodes at the given XPath. |
-p,--pattern {arg} | Pattern. Search the content of nodes for matches to pattern. |
-v,--verbose | Verbose. Show status at each step. |
-x,--XML output | XML Output. Output information as well formed XML documents. |
Acknowledgements:
Development funded by NASA's PDS project at UCLA.
A regular expression is a set of character that define a pattern. Each character is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning. Common metadata characters are:
. | A period or dot matches a single character |
? | The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour". |
* | The asterisk indicates there is zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on. |
+ | The plus sign indicates there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac". |
^ | Matches the starting position within the string. |
$ | Matches the ending position within the string. |
Other metacharacters include {}[]()| and \. These metacharacters are typically used for advance patterns. Wikipedia has a full description of these metacharacters.
When using the regular expressions for searching for tags the metacharacter "/" delineates element names in the same way the a "/" delineates names in a file system path.
If you would like to find an element in an XML document with a certain name one of the most useful patterns to prefix the name with a ".*/". This will find the occurrence of the element any where in the document. For example if you have a document called "family.xml" like:
<?xml version="1.0" encoding="ISO-8859-1"?> <doc> <name>John</name> <sibling> <name>Paul</name> <child> <name>William</name> </child> </sibling> </doc>
You can also find elements by comparing the text content to a pattern.
Suppose you'd like to find the element containing the any words that begin with "Will". Using the pattern like
You can combine both element name patterns and content patterns to find specific elements that have a designed content. Using the content pattern like