Xml – Extract part of an XML file as plain text using XSLT

xmlxslt

Seems like this should be easy, but …

I'm trying to use XSLT to extract part of an XML file as plain text, throwing away the rest.

So from sample input like this …

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://segonku.unl.edu/teianalytics/TEIAnalytics.rng"
                        type="xml"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
   <teiHeader type="text">
      <fileDesc>
         <titleStmt>
            <title>Header Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Published</p>
         </publicationStmt>
         <sourceDesc>
            <p>Sourced</p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <front>
      </front>
      <body>
         <head>THE TITLE</head>
         <div type="chapter" part="N" org="uniform" sample="complete">
            <head>CHAPTER I</head>
            <p>Some text.</p>
         </div>
      </body>
   </text>
</TEI>

… I'm trying to get just the text contained within the <body> tags and all their children. The desired output in this case is:

THE TITLE
CHAPTER I
Some text.

Potential complication: <body> can also exist in the <front> matter and/or in the <teiHeader>, so what I really need is the children of <body> if and only if that tag is a child of <text> and of <TEI>.

I've tried really simple XSL like this …

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="text"/>
    <xsl:template match="/TEI/text/body">
        <xsl:apply-templates select="."/>
    </xsl:template>
</xsl:stylesheet>

… but it gives me plain text of everything in the file, not just the <body> elements.

Thanks!

Best Answer

I've tried really simple XSL like this ...

...
     <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">
         <xsl:output method="text"/>
         <xsl:template match="/TEI/text/body">
             <xsl:apply-templates select="."/>
         </xsl:template>
     </xsl:stylesheet>
... but it gives me plain text of everything in the file, not just the <body> elements.

The reason for this is a famous property/feature of XPath (and reason for many thousands similar questions) to consider any unprefixed name as belonging to "no namespace. However, any element in the provided XML document belongs to the namespace: "http://www.tei-c.org/ns/1.0" and must be accessed as a node in this namespace.

Solution: Define the documents default namespace in the XSLT code (this time with a prefix bound to it) and use the prefix in specifying every name.

This is one of the simplest and shortest possible transformations that produces the wanted result:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:x="http://www.tei-c.org/ns/1.0">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="x:text/x:body//text()">
  <xsl:value-of select="concat(.,'&#xA;')"/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When applied on the provided XML document:

<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
    <teiHeader type="text">
        <fileDesc>
            <titleStmt>
                <title>Header Title</title>
            </titleStmt>
            <publicationStmt>
                <p>Published</p>
            </publicationStmt>
            <sourceDesc>
                <p>Sourced</p>
            </sourceDesc>
        </fileDesc>
    </teiHeader>
    <text>
        <front>      </front>
        <body>
            <head>THE TITLE</head>
            <div type="chapter" part="N" org="uniform" sample="complete">
                <head>CHAPTER I</head>
                <p>Some text.</p>
            </div>
        </body>
    </text>
</TEI>

the wanted, correct result is produced:

THE TITLE
CHAPTER I
Some text.

Related Solutions

XML to CSV using XSLT help

This simple transformation produces the wanted result:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

    <xsl:template match="/">
    <xsl:apply-templates select="//text()"/>
    </xsl:template>

    <xsl:template match="text()">
      <xsl:copy-of select="."/>
      <xsl:if test="not(position()=last())">,</xsl:if>
    </xsl:template>
</xsl:stylesheet>

Do note the use of:

 <xsl:strip-space elements="*"/>

to discard any white-space-only text nodes.

Update: AJ raised the problem that the results shoud be grouped in recirds/tuples per line. It isn't defined in the question what a record/tuple should exactly be. Therefore the current solution solves the two problems of white-space-only text nodes and of missing commas, but does not aim to grop the output into records/tuples.

Google-chrome – Chrome, Firefox and Safari not applying XSLT? IE does!

You need to ensure your page is served with the correct HTTP Content-Type header value in this case: text/xml, possible in PHP using the header function:

header('Content-type: text/xml');
echo $xmlStr;

*thanks to meder who lead me in the right direction for this.

Also In Chrome and Safari an error still occurs while applying the XSLT because of the above doctype-public value:

<xsl:output 
method='xml' 
indent='yes'
doctype-public='"-//W3C//DTD XHTML Basic 1.1//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd"'/>;

It should be:

<xsl:output 
  method="xml"
  indent="yes"
  doctype-public="-//W3C//DTD XHTML Basic 1.1//EN"
  doctype-system="http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd"/>

The doctype-public attribute should not even be looked at if doctype-system is not specfied according to the spec.

*thanks to LarsH for pointing out doctype-system should be in a separate value.

Best Answer

Related Solutions

XML to CSV using XSLT help

Google-chrome – Chrome, Firefox and Safari not applying XSLT? IE does!

Related Topic