Update on Atom feed parsing

By Christian Glahn

I spent the last evening in fixing the problem with Team Space. Already in the office, I made the ATOM parser accept well formated HTML (which is in fact XHTML ... but anyways). Later the evening I applied an look ahead regular expression (RegEx) to fix malformatted URLs and standalone ampersands. It turned out that this RegEx turned the broken HTML into proper XHTML.

For those who are interested in code and can read RegEx statements, the regular expression for that purpose is the following one.

   s/&(?!amp;|lt;|gt;|quot;|apos;)/&amp;/g

This will replace all misplaced ampersands in the HTML. Note that the infamous   entity is not supported.

After that I faced the problem that Blogspot's Atom feeds are not supporting summaries of the entires, but sent the entire content of an entry including all images. The latter were causing a bunch of problems with the layout. Therefore, I decided to remove all the images and objects (yeah, no flash, java, or ActiveX) from the content before it is displayed on Team Space.

Furthermore, I realised that some blog postings simply show up as headlines, but no abstract is available. It turned out that this is due to a size limit I added to an SQL statement. After I removed that limit all contents are displayed.

PermaLink

project, social, software, web2.0, webapplications

Update on Atom feed parsing

Christian Glahn

Search