Reflections of a riverside walker: Teaching Dinosaurs to dance digitally

Part 1 of this article was about why PDF India is not #DigitalIndia and some thoughts on access to public data and data submission.

Now that the rant is over, let me document some of the woes that I came face to face in accessing public data.

A couple of years back, I was consulted by a friend once on having to extract select data fields from hundreds of PDFs from SEBI website, for research work. It involved the extraction of fields such as Offer Price, Negotiated price etc (well, whatever their finance lingo), from Letter of Offer documents submitted by merger/takeover companies to SEBI, over 15 years. Putting together a strategy using open-source/free tools and an eye to automate most of the tedious tasks to the maximum and maintain accuracy, a sequence of steps was devised for automated extraction. In some stages, semi-automated. Some steps required human eye reviews from time to time, about whether the automation ghost, that eerily moved the mouse pointer at midnight, is working properly or is being interrupted by the invalid data ghost.

Here are some of the concerns I came across.

The SEBI page of Letter of Offer for Takeovers, was treated as the starting point for collection of data fields related to Final Letters of Offer.

Original data collection by SEBI is not structured :

The fact that information was collected as PDF and not as structured data gives less scope for meaningful analysis of data. Ideally, the data should have been collected from the companies in more organized formats such as XML, Excel or CSV or by seeking information in a web form, as is the standard practice. This would have allowed the collection of raw data by SEBI in a database-friendly format. Since this was not to be, it led to a situation where the structured data have to be sought by sifting through PDF documents. For ages, we have been looking at standards like IFRS and XBRL from a distance, but nothing moved because SMEs complained of compliance cost.

No options to bulk download :

There were no options to bulk-download documents by querying for multiple companies based on search criteria. One had to traverse the pages a few links at a time to download the documents. This constraint was later overcome partially by a series of semi-automated steps using tools such as FlashGot, TinyTask and the parsing of the pages for file paths.

Inconsistency and non-standard methods in the organising of PDF links in the SEBI website :

The formats in which the document links were organized varied between pre-2005 and post-2005 periods. Links to pre-2005 takeover documents, would lead to a direct download link of the Final Letter of Offer. The file name would be numeric, giving no idea about the company in question. Such as this. On the other hand, post-2005 takover documents, would lead to another intermediary web page of links for the company (such as this), which in turn would lead to the PDF link. In the later years, the PDF file would be named more meaningfully such as LOF etc (which is a relief), but not consistently. This meant more manual downloading and filtering of necessary documents from unnecessary ones. Some HTML tags filtering using tools such as NotePad++, followed by exporting to a database, were used to partially overcome these constraints.

Unsuitability of PDF format for structured data :

The PDF format doesn't lend itself to efficient parsing to collect data. Moreover, the data fields required (such as Average price), were often presented in a table inside the document. This meant that the PDF documents needed to go through a series of steps before data could be culled out from them. Calibre software and the online service pdfonline.com were used in batch processing to convert PDFs to HTML web pages. The individual code lines from the web pages were exported to an SQL database and parsed for HTML tags to look for tables that contained the required fields.

Lack of uniformity and fixed format for the presence of data in the document :

Inside the PDF document, data was not found consistently in the same location or format. One had to look for references to the section title such as 'Financial Justification' and then look for the data in the table that geographically followed it below. Keywords such as 'Negotiated Price', 'Average Price' and various combinations of such phrases had to be listed to look for data. Even after all this, one wasn't sure whether the data fields would indeed be found or not.

Inefficiency of using text-based search to identify data fieds :

When found, they may not be found in a consistent form or a phrase. For instance, 'Not Applicable' could be any of NA, N.A., N-A,a hiphen etc. Sometimes, the justification section would be a paragraph that contained none of the keywords. For an automated lookup, references to 'infrequently traded' might often been confused with 'infrequently traded on NSE and frequently traded on BSE'. These made automated collection inefficient and necessitated a manual study of the PDF documents.

Mismatch between different databases in company name mentions :

While some data fields required for the study were to be found from Letters of Offer, others were extracted from the Prowess database. The takeovers involve an acquiring company and a target company. There were inconsistencies and mixups in both these names between the Prowess and SEBI data sources. SEBI would list the letter of offer against the target company whereas Prowess would list it the other way round. In some cases, where there were multiple takeover instances relating to same company names, requiring the PDFs of multiple years. This meant resolving the mix-up by manually reviewing the PDF and data fields. One had to arrive at a cross-tab of SEBI name, Prowess Name and the correct Letter-of-Offer-Name in the takeover context before one could collect accurate data.

---------------------------

This is not the issue of a single website of one government insitution or a government department. It’s the opposite, of being unable to work as a single website for the whole country. It’s about the isolated approach to.presenting data to the public or seeking data from them, sulking it as an unwanted duty imposed by the Digital India frenzy.

The data.gov.in sets an excellent precedent with public APIs throwing an ocean of private opportunities open. It is high time that other government institutions join the bandwagon of unified data view architecture both for submission and presentation of data, in truly seamless transactional digitization, keeping in view bulk submission and bulk download options. Data.gov.in has plenty of uncompleted agenda ahead of having to unify the data from the states at a regular timeline and keeping it update and in getting the various uncomplying departments to co-operate.

Some data dinosaurs have to be taught to dance digitally, because evolution is binary.
You either become distinct or you become extinct.

Reflections of a riverside walker

Saturday, November 18, 2017

Teaching Dinosaurs to dance digitally - Part 2

No comments:

Post a Comment

Popular Posts

Twitter Clutter

Twitter Clutter

Labels

Blog Archive

Add this blog as favorite @

Blog Friends

About Me

Find me also @