0

I have content in the XML node <content type='html'> which uses the HTML entities &lt; and &gt; for < and > , i.e. in input.xml:

<?xml version='1.0' encoding='utf-8'?> <feed xmlns='http://www.w3.org/2005/Atom' xmlns:blogger='http://schemas.google.com/blogger/2018'> <id>tag:blogger.com,1999:blog-189623866</id> <title>TestBlog</title> <entry> <id>tag:blogger.com,1999:blog-189623866.post-4683409</id> <blogger:type>POST</blogger:type> <blogger:status>LIVE</blogger:status> <author> <name>Author</name> <blogger:type>BLOGGER</blogger:type> </author> <title>Test Post</title> <content type='html'>Lorem &lt;p&gt;Lorem ipsum dolor sit amet.&lt;/p&gt;More Text....</content> 

I need to find/replace these entities with < and > so I get this in output.xml:

<?xml version='1.0' encoding='utf-8'?>.... <content type='html'>Lorem <p>Lorem ipsum dolor sit amet.</p>More Text....</content> 

Trying this

 xmlstarlet edit -N ns="http://www.w3.org/2005/Atom" \ --update "//ns:content" --expr "replace(.,"&lt;","<")" input.xml > output.xml 

throws the shell error zsh: command not found: lt

So I think I need to somehow to escape the &lt; in the shell or in xmlstarlet.

is that possible?

10
  • There are nested double quotes "replace(.,"&lt;","<")". Single quotes should fix that 'replace(.,"&lt;","<")' Commented Sep 17 at 16:05
  • Thanks! That was it. But now I get xmlXPathCompOpEval: function replace not found, Unregistered function. I'm using xmlstarlet --version shows 1.6.1, and these cryptic details: "compiled against libxml2 2.9.13, linked with 20913 compiled against libxslt 1.1.35, linked with 10135" Commented Sep 17 at 16:23
  • You should probably use an xslt approach Commented Sep 17 at 16:37
  • Looks like xmlstarlet edit’s xpath arguments do not support XSLT functions. from martin7th.github.io/xmlstarlet-notes/#error-messages Commented Sep 17 at 18:22
  • All in all that's escaped html within xml. Unescaping it could cause other issues due to broken html. Commented Sep 17 at 18:26

2 Answers 2

1

To refine the answer by Michael Kay, as it doesn't take namespaces into account, here is a complete XSLT 3.0 stylesheet

<?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xpath-default-namespace="http://www.w3.org/2005/Atom" exclude-result-prefixes="#all" expand-text="yes"> <xsl:mode on-no-match="shallow-skip"/> <xsl:template match="content[@type = 'html']"> <xsl:element name="{local-name()}"> <xsl:apply-templates select="@*"/> <xsl:sequence select="parse-xml-fragment(.)"/> </xsl:element> </xsl:template> <xsl:template match="content[@type = 'html']/@*"> <xsl:attribute name="{local-name()}" select="."/> </xsl:template> </xsl:stylesheet> 

that, when run with e.g. Saxon 12 on the input sample

<?xml version='1.0' encoding='utf-8'?> <feed xmlns='http://www.w3.org/2005/Atom' xmlns:blogger='http://schemas.google.com/blogger/2018'> <id>tag:blogger.com,1999:blog-189623866</id> <title>TestBlog</title> <entry> <id>tag:blogger.com,1999:blog-189623866.post-4683409</id> <blogger:type>POST</blogger:type> <blogger:status>LIVE</blogger:status> <author> <name>Author</name> <blogger:type>BLOGGER</blogger:type> </author> <title>Test Post</title> <content type='html'>Lorem &lt;p&gt;Lorem ipsum dolor sit amet.&lt;/p&gt;More Text....</content> </entry> </feed> 

returns the result

<?xml version="1.0" encoding="UTF-8"?><content type="html">Lorem <p>Lorem ipsum dolor sit amet.</p>More Text....</content> 

++++++++++++++++++++++++++++++++++++++&input=++tag:blogger.com,1999:blog-189623866++TestBlog++++++tag:blogger.com,1999:blog-189623866.post-4683409++++POST++++LIVE++++++++++Author++++++BLOGGER++++++++Test+Post++++Lorem+<p>Lorem+ipsum+dolor+sit+amet.</p>More+Text....++&code-type=XSLT&input-type=XML&auto-evaluate=true" rel="nofollow noreferrer">Online fiddle with Saxon 12 HE run in the browser.

Sign up to request clarification or add additional context in comments.

2 Comments

Really appreciate your time :) I didn't realize that xmlstarlet was quite old. I will see if I can learn how to use XSLT 3.0 and Saxon from your complete example. So is HTML always escaped when it is within an XML node? Or only in my example case? Maybe it is escaped to allow the parser at Blogger to import it?
Thanks again! I figured out how to install Java and run Saxon 12.9 on my local machine via the shell and was able to process XML files using a stylesheet. I will be asking more questions :)
1

The replace() function requires XPath 2,0, which although it was published in 2007 is not yet implemented in xmlstarlet.

However, that's not your real problem. xmlstarlet, like other XPath tools, works on the tree representation of a document, not on the surface lexical form. If you insert a < character into the tree representation of a document, then in the lexical serialization it will be turned back into &lt;. What you are really trying to do is to turn part of a text node into an element node, which can't be done just be modifying characters.

You might be able to solve the problem using xmlstarlet's capabilities for manipulating the lexical XML (escape and unescape) as distinct from its XPath capabilities -- but I'm not sufficiently familiar with it. Personally I would tackle this using XSLT 3.0:

<xsl:template match="content[@type='html']/text()"> <xsl:copy-of select="parse-xml-fragment(.)"/> </xsl:template> 

2 Comments

Thanks for your answer :) I didn't realize that xmlstarlet was quite old. I'm going to see if I can learn how to use XSLT 3.0. So is HTML always escaped when it is within an XML node? Or only in this case?
It depends on the author and the format, normally HTML (as HTML 5 or HTML 4 or these days simply HTML) served as text/html doesn't meet the XML well-formedness rules and therefore needs to be escaped. But on the other hand both answers use parse-xml-fragment, meaning, they employ an XML (fragment) parsing function, so the assumption is the content is parseable as (an) XML fragment. That might not work with all stuff being HTML. There is fn:parse-html, but only in commercial editions of Saxon, as XPath 4 is work in progress. Earlier commercial editions of Saxon have `saxon:par

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.