Seems like this should be easy, but …
I'm trying to use XSLT to extract part of an XML file as plain text, throwing away the rest.
So from sample input like this …
<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://segonku.unl.edu/teianalytics/TEIAnalytics.rng"
type="xml"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
<teiHeader type="text">
<fileDesc>
<titleStmt>
<title>Header Title</title>
</titleStmt>
<publicationStmt>
<p>Published</p>
</publicationStmt>
<sourceDesc>
<p>Sourced</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<front>
</front>
<body>
<head>THE TITLE</head>
<div type="chapter" part="N" org="uniform" sample="complete">
<head>CHAPTER I</head>
<p>Some text.</p>
</div>
</body>
</text>
</TEI>
… I'm trying to get just the text contained within the <body>
tags and all their children. The desired output in this case is:
THE TITLE
CHAPTER I
Some text.
Potential complication: <body>
can also exist in the <front>
matter and/or in the <teiHeader>
, so what I really need is the children of <body>
if and only if that tag is a child of <text>
and of <TEI>
.
I've tried really simple XSL like this …
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/TEI/text/body">
<xsl:apply-templates select="."/>
</xsl:template>
</xsl:stylesheet>
… but it gives me plain text of everything in the file, not just the <body>
elements.
Thanks!
Best Answer
The reason for this is a famous property/feature of XPath (and reason for many thousands similar questions) to consider any unprefixed name as belonging to "no namespace. However, any element in the provided XML document belongs to the namespace: "http://www.tei-c.org/ns/1.0" and must be accessed as a node in this namespace.
Solution: Define the documents default namespace in the XSLT code (this time with a prefix bound to it) and use the prefix in specifying every name.
This is one of the simplest and shortest possible transformations that produces the wanted result:
When applied on the provided XML document:
the wanted, correct result is produced: