xmlgrep

These examples illustrate how to use the igpp.xml/xmlgrep tool which is an XML aware grep tool that can search for elements (tags) that match a pattern as well as search the content of elements (tags) that match a pattern. The patterns are expressed with a regular expression syntax.

XML Parser, XPath generator and search tool.

Parses an XML file and flattens the document content. Values can be retrieved using a simplified XPath that can contain regular expressions for node names. All values in the XML file can be listed with a corresponding XPath.

usage: java igpp.xml.XMLGrep [options] file

-e,--extract {arg}	Extract. Extract all nodes with a given XPath.
-f,--find {arg}	Find. Locate the value associated with an XPath.
-h,--help	Display this text.
-n,--nodes {arg}	Nodes. List all nodes at the given XPath.
-p,--pattern {arg}	Pattern. Search the content of nodes for matches to pattern.
-v,--verbose	Verbose. Show status at each step.
-x,--XML output	XML Output. Output information as well formed XML documents.

Acknowledgements:
Development funded by NASA's PDS project at UCLA.

Brief Introduction to Regular Expressions

A regular expression is a set of character that define a pattern. Each character is either understood to be a metacharacter with its special meaning, or a regular character with its literal meaning. Common metadata characters are:

.	A period or dot matches a single character
?	The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour".
*	The asterisk indicates there is zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
+	The plus sign indicates there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
^	Matches the starting position within the string.
$	Matches the ending position within the string.

Other metacharacters include {}[]()| and \. These metacharacters are typically used for advance patterns. Wikipedia has a full description of these metacharacters.

When using the regular expressions for searching for tags the metacharacter "/" delineates element names in the same way the a "/" delineates names in a file system path.

Find Elements by Name

If you would like to find an element in an XML document with a certain name one of the most useful patterns to prefix the name with a ".*/". This will find the occurrence of the element any where in the document. For example if you have a document called "family.xml" like:

<?xml version="1.0" encoding="ISO-8859-1"?>
<doc>
   <name>John</name>
   <sibling>
      <name>Paul</name>
      <child>
         <name>William</name>
      </child>
   </sibling>
</doc>

and you use a pattern like:

.*/name

with a command like:

xmlgrep -f ".*/name" family.xml

you will retrieve three values:

/doc/name: John
/doc/sibling/name: Paul
/doc/sibling/child/name: William

the prefix for each value is the element path (XPath) to the value.

Finding Elements by Content

You can also find elements by comparing the text content to a pattern.

Suppose you'd like to find the element containing the any words that begin with "Will". Using the pattern like

Will.*

with a command like:

xmlgrep -p "Will.*" family.xml

you will retrieve three values:

/doc/sibling/child/name: William

Combining Patterns

You can combine both element name patterns and content patterns to find specific elements that have a designed content. Using the content pattern like

.*a.*

and a element pattern like

.*/name

with a command like:

xmlgrep -f ".*/name" -p ".*a.*" family.xml

you will retrieve three values:

/doc/sibling/name: Paul
/doc/sibling/child/name: William

since both names contain a letter "a".