PHP “pretty print” HTML (not Tidy)

formathtmlPHPtidy

I'm using the DOM extension in PHP to build some HTML documents, and I want the output to be formatted nicely (with new lines and indentation) so that it's readable, however, from the many tests I've done:

"formatOutput = true" doesn't work at all with saveHTML(), only saveXML()
Even if I used saveXML(), it still only works on elements created via the DOM, not elements that are included with loadHTML(), even with "preserveWhiteSpace = false"

If anyone knows differently I'd really like to know how they got it to work.

So, I have a DOM document, and I'm using saveHTML() to output the HTML. As it's coming from the DOM I know it is valid, there's no need to "Tidy" or validate it in any way.

I'm simply looking for a way to get nicely formatted output from the output I receive from the DOM extension.

NB. As you may have guessed, I don't want to use the Tidy extension as a) it does a lot more that I need it too (the markup is already valid) and b) it actually makes changes to the HTML content (such as the HTML 5 doctype and some elements).

Follow Up:

OK, with the help of the answer below I've worked out why the DOM extension wasn't working. Although the given example works, it still wasn't working with my code. With the help of this comment I found that if you have any text nodes where isWhitespaceInElementContent() is true no formatting will be applied beyond that point. This happens regardless of whether or not preserveWhiteSpace is false. The solution is to remove all of these nodes (although I'm not sure if this may have adverse effects on the actual content).

Best Answer

you're right, there seems to be no indentation for HTML (others are also confused). XML works, even with loaded code.

<?php
function tidyHTML($buffer) {
    // load our document into a DOM object
    $dom = new DOMDocument();
    // we want nice output
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($buffer);
    $dom->formatOutput = true;
    return($dom->saveHTML());
}

// start output buffering, using our nice
// callback function to format the output.
ob_start("tidyHTML");

?>
<html>
    <head>
    <title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>It's like comparing apples to oranges.</p></body></html>
<?php
// this will be called implicitly, but we'll
// call it manually to illustrate the point.
ob_end_flush();
?>

result:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>foo bar</title>
<meta name="bar" value="foo">
</head>
<body>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
</body>
</html>

the same with saveXML() ...

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
    <title>foo bar</title>
    <meta name="bar" value="foo"/>
  </head>
  <body>
    <h1>bar foo</h1>
    <p>It's like comparing apples to oranges.</p>
  </body>
</html>

probably forgot to set preserveWhiteSpace=false before loadHTML?

disclaimer: i stole most of the demo code from tyson clugg/php manual comments. lazy me.

UPDATE: i now remember some years ago i tried the same thing and ran into the same problem. i fixed this by applying a dirty workaround (wasn't performance critical): i just somehow converted around between SimpleXML and DOM until the problem vanished. i suppose the conversion got rid of those nodes. maybe load with dom, import with simplexml_import_dom, then output the string, parse this with DOM again and then printed it pretty. as far as i remember this worked (but it was really slow).

Correctly setting up the connection

Note that when using PDO to access a MySQL database real prepared statements are not used by default. To fix this you have to disable the emulation of prepared statements. An example of creating a connection using PDO is:

$dbConnection = new PDO('mysql:dbname=dbtest;host=127.0.0.1;charset=utf8', 'user', 'password');

$dbConnection->setAttribute(PDO::ATTR_EMULATE_PREPARES, false);
$dbConnection->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

In the above example the error mode isn't strictly necessary, but it is advised to add it. This way the script will not stop with a Fatal Error when something goes wrong. And it gives the developer the chance to catch any error(s) which are thrown as PDOExceptions.

What is mandatory, however, is the first setAttribute() line, which tells PDO to disable emulated prepared statements and use real prepared statements. This makes sure the statement and the values aren't parsed by PHP before sending it to the MySQL server (giving a possible attacker no chance to inject malicious SQL).

Although you can set the charset in the options of the constructor, it's important to note that 'older' versions of PHP (before 5.3.6) silently ignored the charset parameter in the DSN.

Explanation

The SQL statement you pass to prepare is parsed and compiled by the database server. By specifying parameters (either a ? or a named parameter like :name in the example above) you tell the database engine where you want to filter on. Then when you call execute, the prepared statement is combined with the parameter values you specify.

The important thing here is that the parameter values are combined with the compiled statement, not an SQL string. SQL injection works by tricking the script into including malicious strings when it creates SQL to send to the database. So by sending the actual SQL separately from the parameters, you limit the risk of ending up with something you didn't intend.

Any parameters you send when using a prepared statement will just be treated as strings (although the database engine may do some optimization so parameters may end up as numbers too, of course). In the example above, if the $name variable contains 'Sarah'; DELETE FROM employees the result would simply be a search for the string "'Sarah'; DELETE FROM employees", and you will not end up with an empty table.

Another benefit of using prepared statements is that if you execute the same statement many times in the same session it will only be parsed and compiled once, giving you some speed gains.

Oh, and since you asked about how to do it for an insert, here's an example (using PDO):

$preparedStatement = $db->prepare('INSERT INTO table (column) VALUES (:column)');

$preparedStatement->execute([ 'column' => $unsafeValue ]);

Can prepared statements be used for dynamic queries?

While you can still use prepared statements for the query parameters, the structure of the dynamic query itself cannot be parametrized and certain query features cannot be parametrized.

For these specific scenarios, the best thing to do is use a whitelist filter that restricts the possible values.

// Value whitelist
// $dir can only be 'DESC', otherwise it will be 'ASC'
if (empty($dir) || $dir !== 'DESC') {
   $dir = 'ASC';
}

Html – What are valid values for the id attribute in HTML

For HTML 4, the answer is technically:

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

HTML 5 is even more permissive, saying only that an id must contain at least one character and may not contain any space characters.

The id attribute is case sensitive in XHTML.

As a purely practical matter, you may want to avoid certain characters. Periods, colons and '#' have special meaning in CSS selectors, so you will have to escape those characters using a backslash in CSS or a double backslash in a selector string passed to jQuery. Think about how often you will have to escape a character in your stylesheets or code before you go crazy with periods and colons in ids.

For example, the HTML declaration <div id="first.name"></div> is valid. You can select that element in CSS as #first\.name and in jQuery like so: $('#first\\.name'). But if you forget the backslash, $('#first.name'), you will have a perfectly valid selector looking for an element with id first and also having class name. This is a bug that is easy to overlook. You might be happier in the long run choosing the id first-name (a hyphen rather than a period), instead.

You can simplify your development tasks by strictly sticking to a naming convention. For example, if you limit yourself entirely to lower-case characters and always separate words with either hyphens or underscores (but not both, pick one and never use the other), then you have an easy-to-remember pattern. You will never wonder "was it firstName or FirstName?" because you will always know that you should type first_name. Prefer camel case? Then limit yourself to that, no hyphens or underscores, and always, consistently use either upper-case or lower-case for the first character, don't mix them.

A now very obscure problem was that at least one browser, Netscape 6, incorrectly treated id attribute values as case-sensitive. That meant that if you had typed id="firstName" in your HTML (lower-case 'f') and #FirstName { color: red } in your CSS (upper-case 'F'), that buggy browser would have failed to set the element's color to red. At the time of this edit, April 2015, I hope you aren't being asked to support Netscape 6. Consider this a historical footnote.

Best Answer

Related Solutions

Php – How to prevent SQL injection in PHP

Correctly setting up the connection

Explanation

Can prepared statements be used for dynamic queries?

Html – What are valid values for the id attribute in HTML

Related Topic