Saturday, November 18, 2017

Teaching Dinosaurs to dance digitally - Part 2

Part 1 of this article was about why PDF India is not #DigitalIndia and some thoughts on access to public data and data submission. 

Now that the rant is over, let me document some of the woes that I came face to face in accessing public data.

A couple of years back, I was consulted by a friend once on having to extract select data fields from hundreds of PDFs from SEBI website, for research work. It involved the extraction of fields such as Offer Price, Negotiated price etc (well, whatever their finance lingo), from Letter of Offer documents submitted by merger/takeover companies to SEBI, over 15 years. Putting together a strategy using open-source/free tools and an eye to automate most of the tedious tasks to the maximum and maintain accuracy, a sequence of steps was devised for automated extraction. In some stages, semi-automated. Some steps required human eye reviews from time to time, about whether the automation ghost, that eerily moved the mouse pointer at midnight, is working properly or is being interrupted by the invalid data ghost.

Here are some of the concerns I came across. 

The SEBI page of Letter of Offer for Takeovers, was treated as the starting point for collection of data fields related to Final Letters of Offer. 

Original data collection by SEBI is not structured :

The fact that information was collected as PDF and not as structured data gives less scope for meaningful analysis of data. Ideally, the data should have been collected from the companies in more organized formats such as XML, Excel or CSV or by seeking information in a web form, as is the standard practice. This would have allowed the collection of raw data by SEBI in a database-friendly format. Since this was not to be, it led to a situation where the structured data have to be sought by sifting through PDF documents. For ages, we have been looking at standards like IFRS and XBRL from a distance, but nothing moved because SMEs complained of compliance cost.

No options to bulk download :

There were no options to bulk-download documents by querying for multiple companies based on search criteria. One had to traverse the pages a few links at a time to download the documents. This constraint was later overcome partially by a series of semi-automated steps using tools such as FlashGotTinyTask and the parsing of the pages for file paths.

Inconsistency and non-standard methods in the organising of PDF links in the SEBI website :

The formats in which the document links were organized varied between pre-2005 and post-2005 periods. Links to pre-2005 takeover documents, would lead to a direct download link of the Final Letter of Offer. The file name would be numeric, giving no idea about the company in question. Such as this.  On the other hand, post-2005 takover documents, would lead to another intermediary web page of links for the company (such as this), which in turn would lead to the PDF link. In the later years, the PDF file would be named more meaningfully such as LOF etc (which is a relief), but not consistently. This meant more manual downloading and filtering of necessary documents from unnecessary ones. Some HTML tags filtering using tools such as NotePad++, followed by exporting to a database, were used to partially overcome these constraints.

Unsuitability of PDF format for structured data :

The PDF format doesn't lend itself to efficient parsing to collect data. Moreover, the data fields required (such as Average price), were often presented in a table inside the document. This meant that the PDF documents needed to go through a series of steps before data could be culled out from them. Calibre software and the online service pdfonline.com were used in batch processing to convert PDFs to HTML web pages. The individual code lines from the web pages were exported to an SQL database and parsed for HTML tags to look for tables that contained the required fields.

Lack of uniformity and fixed format for the presence of data in the document :

Inside the PDF document, data was not found consistently in the same location or format. One had to look for references to the section title such as 'Financial Justification' and then look for the data in the table that geographically followed it below. Keywords such as 'Negotiated Price', 'Average Price' and various combinations of such phrases had to be listed to look for data. Even after all this, one wasn't sure whether the data fields would indeed be found or not. 

Inefficiency of using text-based search to identify data fieds :

When found, they may not be found in a consistent form or a phrase. For instance, 'Not Applicable' could be any of NA, N.A., N-A,a hiphen etc. Sometimes, the justification section would be a paragraph that contained none of the keywords. For an automated lookup, references to 'infrequently traded' might often been confused with 'infrequently traded on NSE and frequently traded on BSE'.   These made automated collection inefficient and necessitated a manual study of the PDF documents.

Mismatch between different databases in company name mentions :

While some data fields required for the study were to be found from Letters of Offer, others were extracted from the Prowess database. The takeovers involve an acquiring company and a target company. There were inconsistencies and mixups in both these names between the Prowess and SEBI data sources. SEBI would list the letter of offer against the target company whereas Prowess would list it the other way round. In some cases, where there were multiple takeover instances relating to same company names, requiring the PDFs of multiple years. This meant resolving the mix-up by manually reviewing the PDF and data fields. One had to arrive at a cross-tab of SEBI name, Prowess Name and the correct Letter-of-Offer-Name in the takeover context before one could collect accurate data.

---------------------------

This is not the issue of a single website of one government insitution or a government department. It’s the opposite, of being unable to work as a single website for the whole country. It’s about the isolated approach to.presenting data to the public or seeking data from them, sulking it as an unwanted duty imposed by the Digital India frenzy.

The data.gov.in  sets an excellent precedent with public APIs throwing an ocean of private opportunities open. It is high time that other government institutions join the bandwagon of unified data view architecture both for submission and presentation of data, in truly seamless transactional digitization, keeping in view bulk submission and bulk download options. Data.gov.in has plenty of uncompleted agenda ahead of having to unify the data from the states at a regular timeline and keeping it update and in getting the various uncomplying departments to co-operate.

Some data dinosaurs have to be taught to dance digitally, because evolution is binary. 

You either become distinct or you become extinct. 




Teaching dinosaurs to dance digitally

It ain’t so much of #DigitalIndia yet. It's just PDF India. For those looking for meaningful, processible data, the difference between the two can make a world of difference. For some government institutions, digital means PDF or JPEG image of scanned print reports, including spreadsheets and balance sheets. One might argue that nowadays it's possible to extract data from PDFs using technology. But it's like garbage disposal and recycling. The messup shouldn't be created in the first place. Image processing algorithms were not invented for spreadsheets.

A good example of a bad PDF document is the District Census Handbook of 2011. It will have reams and reams of pages, with one page showing the first half of a horizontal spreadsheet, titled “Industrial Catgory of”, and next page showing the other half, titled “Marginal Workers”. Man, that must be some innovation in bad design .

Switch to the good example of presentation of the same data. data.gov.in presents hundreds of datasets from the Census of 2011 in open formats for public access. You can access the data for the same district in CSV format, loadable into Excel. Did you know you could download the Indian Railways timetable in Excel ?. How sweet!

The salient rule in data collection or presentation must be that, at the raw source, the data must be collected in a format that is processible by automation. It should minimize human eye intervention only for reviews and green/red flags and throwing up exception patterns. Or for discovering insights of wisdom from "rich experience", which a chip can't discern. Days may have come when chips outsmart elders in experiential wisdom as well, LoL.

Companies like HowIndiaLives.com , where my friend Ramnath is involved, address this problem by helping their customers and website visitors make visual sense out of the non-sense data in India's public domain. They also present it in a beautiful and user-friendly ways and help project stakeholders glean useful insights from data. That data shouldn't have been nonsensical in the first place, in its raw open form, is the sad fact. That they converted rants like these into a business opportunity is their ingenuity. While raw data must be disseminated in open processible formats, it should lead to an ecosystem of companies like these, which compete in discovering insights from data and presenting them, without having to spend too much time cleaning it. Not just cleaning it, but having to fight inconsistencies between multiple sources of data for the same item of information, I guess, must be another tedious task.

Taking a leaf out of international open data like data.un.org, the open data platform for India is a big leap in this direction for sharing of public datasets (though they don’t have options to bulk download data). 

On the other hand, some of the transactional websites of the government websites can make life extremely tedious. If you are a high-volume transaction submitter, your life can become miserable, having to submit thousands of records into old-style web forms. Some of them must have been in a cave since AJAX was invented. They can put thousands of person names with option buttons on a single page, expecting the user to scroll down or use the browser's Find, choose one name and then submit. Wwwhaaat!

The format in which data must be submitted to government websites must be pre-defined with digital processing in mind. For a billion-headed Titanosaur like India, it should definitely have scale in mind too. Ideally, it shouldn't even be Windows OS intensive and Windows OS requiring. Kerala, for example, wants to dabble in Linux and it's a good thing. I hope they don't up give up like Munich, the city that wanted to run on Linux. But, that kind of thinking is good. It may lead at least to the adoption of open formats for seeking data, if not an open-source OS.

Not all is a sad story with Digital India. There are a handful of bright spots in good design, that take scale and digital processibility into account. Aadhaar, no doubt, is a beautiful example. The Income Tax website often has some sudden quirky differences between its Java tool and the Excel tool with mysterious conclusions of inability to generate the XML file. But it at least uses XML to upload data. Thats a good thing. Even within the Income Tax Department, you may not find the same kind of good design for other tasks, for instance, for applying for non-deduction of TDS. Another of my favorite examples of handling technology at scale carefully is SBI's transition to core banking and their merger with associate banks. It was not about open data or about government per se. But, at its scale, it's truly a project of teaching elephants to dance, and for their size, they did a mighty good job at it. The GSTN must be the next Aadhaar-like unifier, after the easing out of the initial troubles.


This is the intermission, the end of Part 1. 

In Part 2, I mention some of the woes I faced while extracting data from PDFs from the SEBI website. It has a sad ending that rounds up by saying :

Some data dinosaurs have to be taught to dance digitally, because evolution is binary. 
You either become distinct or you become extinct.  

Lot of technical debris ahead on Part 2. Blissful poets, musicians and other non-tech readers not allowed beyond this point. :-) :-)

-->> Part 2 


 
THANK YOU: These reflections draw sometimes from readers and friends who initiate ideas, build up discussions, post comments and mention interesting links, some online and some over a cup of coffee or during a riverside walk. Thank you.

Disclaimer: Views expressed in this blog are the blogger's personal opinions and made in his individual capacity, sometimes have a story-type approach, mixing facts with imagination and should not be construed as arising from a professional position or a counselling intention.