Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.
A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php
How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.
The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.
A basic usage example can be found at getting all values from h1 tags using php
This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.
The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.
A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.
FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.
Wa72\HtmlPageDom
is a PHP library for easy manipulation of HTML
documents using DOM. It requires DomCrawler from Symfony2
components for traversing
the DOM tree and extends it by adding methods for manipulating the
DOM tree of HTML documents.
phpQuery (not updated for years)
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).
Also see: https://github.com/electrolinux/phpquery
Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.
fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.
sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.
FluidXML is a PHP library for manipulating XML with a concise and fluent API.
It leverages XPath and the fluent programming pattern to be fun and effective.
3rd-Party (not libxml-based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below
- An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
- Require PHP 5+.
- Supports invalid HTML.
- Find tags on an HTML page with selectors just like jQuery.
- Extract contents from HTML in a single line.
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.
Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.
- A universal tokenizer and HTML/XML/RSS DOM Parser
-
Ability to manipulate elements and their attributes
-
Supports invalid HTML and UTF8
-
Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
- A HTML beautifier (like HTML Tidy)
-
Minify CSS and Javascript
-
Sort attributes, change character case, correct indentation, etc.
- Extensible
-
Parsing documents using callbacks based on current character/token
-
Operations separated in smaller functions for easy overriding
- Fast and Easy
Never used it. Can't tell if it's any good.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like
html5lib
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.
We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.
WebServices
If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.
ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.
Regular Expressions
Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.
You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
I am not affiliated with PHP Architect or the authors.
What are the syntax errors?
PHP belongs to the C-style and imperative programming languages. It has rigid grammar rules, which it cannot recover from when encountering misplaced symbols or identifiers. It can't guess your coding intentions.
Most important tips
There are a few basic precautions you can always take:
Use proper code indentation, or adopt any lofty coding style.
Readability prevents irregularities.
Use an IDE or editor for PHP with syntax highlighting.
Which also help with parentheses/bracket balancing.
Read the language reference and examples in the manual.
Twice, to become somewhat proficient.
How to interpret parser errors
A typical syntax error message reads:
Parse error: syntax error, unexpected T_STRING, expecting ';
' in file.php on line 217
Which lists the possible location of a syntax mistake. See the mentioned file name and line number.
A moniker such as T_STRING
explains which symbol the parser/tokenizer couldn't process finally. This isn't necessarily the cause of the syntax mistake, however.
It's important to look into previous code lines as well. Often syntax errors are just mishaps that happened earlier. The error line number is just where the parser conclusively gave up to process it all.
Solving syntax errors
There are many approaches to narrow down and fix syntax hiccups.
Open the mentioned source file. Look at the mentioned code line.
For runaway strings and misplaced operators, this is usually where you find the culprit.
Read the line left to right and imagine what each symbol does.
More regularly you need to look at preceding lines as well.
In particular, missing ;
semicolons are missing at the previous line ends/statement. (At least from the stylistic viewpoint. )
If {
code blocks }
are incorrectly closed or nested, you may need to investigate even further up the source code. Use proper code indentation to simplify that.
Look at the syntax colorization!
Strings and variables and constants should all have different colors.
Operators +-*/.
should be tinted distinct as well. Else they might be in the wrong context.
If you see string colorization extend too far or too short, then you have found an unescaped or missing closing "
or '
string marker.
Having two same-colored punctuation characters next to each other can also mean trouble. Usually, operators are lone if it's not ++
, --
, or parentheses following an operator. Two strings/identifiers directly following each other are incorrect in most contexts.
Whitespace is your friend.
Follow any coding style.
Break up long lines temporarily.
You can freely add newlines between operators or constants and strings. The parser will then concretize the line number for parsing errors. Instead of looking at the very lengthy code, you can isolate the missing or misplaced syntax symbol.
Split up complex if
statements into distinct or nested if
conditions.
Instead of lengthy math formulas or logic chains, use temporary variables to simplify the code. (More readable = fewer errors.)
Add newlines between:
- The code you can easily identify as correct,
- The parts you're unsure about,
- And the lines which the parser complains about.
Partitioning up long code blocks really helps to locate the origin of syntax errors.
Comment out offending code.
If you can't isolate the problem source, start to comment out (and thus temporarily remove) blocks of code.
As soon as you got rid of the parsing error, you have found the problem source. Look more closely there.
Sometimes you want to temporarily remove complete function/method blocks. (In case of unmatched curly braces and wrongly indented code.)
When you can't resolve the syntax issue, try to rewrite the commented out sections from scratch.
As a newcomer, avoid some of the confusing syntax constructs.
The ternary ? :
condition operator can compact code and is useful indeed. But it doesn't aid readability in all cases. Prefer plain if
statements while unversed.
PHP's alternative syntax (if:
/elseif:
/endif;
) is common for templates, but arguably less easy to follow than normal {
code }
blocks.
The most prevalent newcomer mistakes are:
Missing semicolons ;
for terminating statements/lines.
Mismatched string quotes for "
or '
and unescaped quotes within.
Forgotten operators, in particular for the string .
concatenation.
Unbalanced (
parentheses )
. Count them in the reported line. Are there an equal number of them?
Don't forget that solving one syntax problem can uncover the next.
If you make one issue go away, but other crops up in some code below, you're mostly on the right path.
If after editing a new syntax error crops up in the same line, then your attempted change was possibly a failure. (Not always though.)
Restore a backup of previously working code, if you can't fix it.
- Adopt a source code versioning system. You can always view a
diff
of the broken and last working version. Which might be enlightening as to what the syntax problem is.
Invisible stray Unicode characters: In some cases, you need to use a hexeditor or different editor/viewer on your source. Some problems cannot be found just from looking at your code.
Try grep --color -P -n "\[\x80-\xFF\]" file.php
as the first measure to find non-ASCII symbols.
In particular BOMs, zero-width spaces, or non-breaking spaces, and smart quotes regularly can find their way into the source code.
Take care of which type of linebreaks are saved in files.
PHP just honors \n newlines, not \r carriage returns.
Which is occasionally an issue for MacOS users (even on OS X for misconfigured editors).
It often only surfaces as an issue when single-line //
or #
comments are used. Multiline /*...*/
comments do seldom disturb the parser when linebreaks get ignored.
If your syntax error does not transmit over the web:
It happens that you have a syntax error on your machine. But posting the very same file online does not exhibit it anymore. Which can only mean one of two things:
Check your PHP version. Not all syntax constructs are available on every server.
Those aren't necessarily the same. In particular when working with frameworks, you will them to match up.
Don't use PHP's reserved keywords as identifiers for functions/methods, classes or constants.
Trial-and-error is your last resort.
If all else fails, you can always google your error message. Syntax symbols aren't as easy to search for (Stack Overflow itself is indexed by SymbolHound though). Therefore it may take looking through a few more pages before you find something relevant.
Further guides:
White screen of death
If your website is just blank, then typically a syntax error is the cause.
Enable their display with:
error_reporting = E_ALL
display_errors = 1
In your php.ini
generally, or via .htaccess
for mod_php,
or even .user.ini
with FastCGI setups.
Enabling it within the broken script is too late because PHP can't even interpret/run the first line. A quick workaround is crafting a wrapper script, say test.php
:
<?php
error_reporting(E_ALL);
ini_set("display_errors", 1);
include("./broken-script.php");
Then invoke the failing code by accessing this wrapper script.
It also helps to enable PHP's error_log
and look into your webserver's error.log
when a script crashes with HTTP 500 responses.
Best Answer
Difficult to go off of without an exact error message, but these (sql1, sql2, etc) should be variables (flagged with $):