Best Way to Parse a File in PHP – Web Development Guide

designparsingPHPweb-development

I'm trying to find a better solution for making a parser to some of the famous file formats out there such as: EDIFACT and TRADACOMS.

If you aren't familiar with these standards then check out this example from Wikipedia:

See below for an example of an EDIFACT message used to answer to a
product availability request:-

UNA:+.? '
UNB+IATB:1+6XPPC+LHPPC+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
IFT+3+XYZCOMPANY AVAILABILITY'
ERC+A7V:1:AMD'
IFT+3+NO MORE FLIGHTS'
ODI'
TVL+240493:1000::1220+FRA+JFK+DL+400+C'
PDI++C:3+Y::3+F::1'
APD+714C:0:::6++++++6X'
TVL+240493:1740::2030+JFK+MIA+DL+081+C'
PDI++C:4'
APD+EM2:0:130::6+++++++DA'
UNT+13+1'
UNZ+1+1'

The UNA segment is optional. If present, it specifies the special characters that are to be used to interpret the remainder of the message. There are six characters following UNA in this order:

  • component data element separator (: in this sample)
  • data element separator (+ in this sample)
  • decimal notification (. in this sample)
  • release character (? in this sample)
  • reserved, must be a space
  • segment terminator (' in this sample)

As you can see it's just some data formatted in a special way waiting to be parsed (much like XML files).

Now my system is built on PHP and I was able to create parser using regular expressions for each segment, but the problem is not everybody implements the standard perfectly.

Some suppliers tend to ignore optional segments and fields entirely. Others may choose to send more data than others. That's why I was forced to create validators for segments and fields to test if the file was correct or not.

You can imagine the nightmare of regular expressions I'm having right now. In addition each supplier needs many modifications to the regular expressions that I tend to build a parser for each supplier.


Questions:

1- Is this the best practice for parsing files (using regular expressions)?

2- Is there a better solution for parsing files (maybe there are ready made solution out there)? Will it be able to show what segment is missing or if the file is corrupted?

3- If I have to build my parser anyway what design pattern or methodology should I use?

Notes:

I read somewhere about yacc and ANTLR, but I don't know if they match my needs or not!

Best Answer

What you need is a true parser. Regular expressions handle lexing, not parsing. That is, they identify tokens within your input stream. Parsing is the context of the tokens, I.E. who goes where and in what order.

The classic parsing tool is yacc/bison. The classic lexer is lex/flex. Since php allows for integrating C code, you can use flex and bison to build your parser, have php call it on the input file/stream, and then get your results.

It will be blazing fast, and far easier to work with once you understand the tools. I suggest reading Lex and Yacc 2nd Ed. from O'Reilly. For an example, I've set up a flex and bison project on github, with a makefile. It is cross compilable for windows if necessary.

It is complex, but as you found out, what you need done is complex. There is a great deal of "stuff" that must be done for a properly working parser, and flex and bison deal with the mechanical bits. Otherwise, you find yourself in the unenviable position of writing code at the same abstraction layer as assembly.