Saturday, November 18, 2017

Teaching Dinosaurs to dance digitally - Part 2

Part 1 of this article was about why PDF India is not #DigitalIndia and some thoughts on access to public data and data submission. 

Now that the rant is over, let me document some of the woes that I came face to face in accessing public data.

A couple of years back, I was consulted by a friend once on having to extract select data fields from hundreds of PDFs from SEBI website, for research work. It involved the extraction of fields such as Offer Price, Negotiated price etc (well, whatever their finance lingo), from Letter of Offer documents submitted by merger/takeover companies to SEBI, over 15 years. Putting together a strategy using open-source/free tools and an eye to automate most of the tedious tasks to the maximum and maintain accuracy, a sequence of steps was devised for automated extraction. In some stages, semi-automated. Some steps required human eye reviews from time to time, about whether the automation ghost, that eerily moved the mouse pointer at midnight, is working properly or is being interrupted by the invalid data ghost.

Here are some of the concerns I came across. 

The SEBI page of Letter of Offer for Takeovers, was treated as the starting point for collection of data fields related to Final Letters of Offer. 

Original data collection by SEBI is not structured :

The fact that information was collected as PDF and not as structured data gives less scope for meaningful analysis of data. Ideally, the data should have been collected from the companies in more organized formats such as XML, Excel or CSV or by seeking information in a web form, as is the standard practice. This would have allowed the collection of raw data by SEBI in a database-friendly format. Since this was not to be, it led to a situation where the structured data have to be sought by sifting through PDF documents. For ages, we have been looking at standards like IFRS and XBRL from a distance, but nothing moved because SMEs complained of compliance cost.

No options to bulk download :

There were no options to bulk-download documents by querying for multiple companies based on search criteria. One had to traverse the pages a few links at a time to download the documents. This constraint was later overcome partially by a series of semi-automated steps using tools such as FlashGotTinyTask and the parsing of the pages for file paths.

Inconsistency and non-standard methods in the organising of PDF links in the SEBI website :

The formats in which the document links were organized varied between pre-2005 and post-2005 periods. Links to pre-2005 takeover documents, would lead to a direct download link of the Final Letter of Offer. The file name would be numeric, giving no idea about the company in question. Such as this.  On the other hand, post-2005 takover documents, would lead to another intermediary web page of links for the company (such as this), which in turn would lead to the PDF link. In the later years, the PDF file would be named more meaningfully such as LOF etc (which is a relief), but not consistently. This meant more manual downloading and filtering of necessary documents from unnecessary ones. Some HTML tags filtering using tools such as NotePad++, followed by exporting to a database, were used to partially overcome these constraints.

Unsuitability of PDF format for structured data :

The PDF format doesn't lend itself to efficient parsing to collect data. Moreover, the data fields required (such as Average price), were often presented in a table inside the document. This meant that the PDF documents needed to go through a series of steps before data could be culled out from them. Calibre software and the online service pdfonline.com were used in batch processing to convert PDFs to HTML web pages. The individual code lines from the web pages were exported to an SQL database and parsed for HTML tags to look for tables that contained the required fields.

Lack of uniformity and fixed format for the presence of data in the document :

Inside the PDF document, data was not found consistently in the same location or format. One had to look for references to the section title such as 'Financial Justification' and then look for the data in the table that geographically followed it below. Keywords such as 'Negotiated Price', 'Average Price' and various combinations of such phrases had to be listed to look for data. Even after all this, one wasn't sure whether the data fields would indeed be found or not. 

Inefficiency of using text-based search to identify data fieds :

When found, they may not be found in a consistent form or a phrase. For instance, 'Not Applicable' could be any of NA, N.A., N-A,a hiphen etc. Sometimes, the justification section would be a paragraph that contained none of the keywords. For an automated lookup, references to 'infrequently traded' might often been confused with 'infrequently traded on NSE and frequently traded on BSE'.   These made automated collection inefficient and necessitated a manual study of the PDF documents.

Mismatch between different databases in company name mentions :

While some data fields required for the study were to be found from Letters of Offer, others were extracted from the Prowess database. The takeovers involve an acquiring company and a target company. There were inconsistencies and mixups in both these names between the Prowess and SEBI data sources. SEBI would list the letter of offer against the target company whereas Prowess would list it the other way round. In some cases, where there were multiple takeover instances relating to same company names, requiring the PDFs of multiple years. This meant resolving the mix-up by manually reviewing the PDF and data fields. One had to arrive at a cross-tab of SEBI name, Prowess Name and the correct Letter-of-Offer-Name in the takeover context before one could collect accurate data.

---------------------------

This is not the issue of a single website of one government insitution or a government department. It’s the opposite, of being unable to work as a single website for the whole country. It’s about the isolated approach to.presenting data to the public or seeking data from them, sulking it as an unwanted duty imposed by the Digital India frenzy.

The data.gov.in  sets an excellent precedent with public APIs throwing an ocean of private opportunities open. It is high time that other government institutions join the bandwagon of unified data view architecture both for submission and presentation of data, in truly seamless transactional digitization, keeping in view bulk submission and bulk download options. Data.gov.in has plenty of uncompleted agenda ahead of having to unify the data from the states at a regular timeline and keeping it update and in getting the various uncomplying departments to co-operate.

Some data dinosaurs have to be taught to dance digitally, because evolution is binary. 

You either become distinct or you become extinct. 




Teaching dinosaurs to dance digitally

It ain’t so much of #DigitalIndia yet. It's just PDF India. For those looking for meaningful, processible data, the difference between the two can make a world of difference. For some government institutions, digital means PDF or JPEG image of scanned print reports, including spreadsheets and balance sheets. One might argue that nowadays it's possible to extract data from PDFs using technology. But it's like garbage disposal and recycling. The messup shouldn't be created in the first place. Image processing algorithms were not invented for spreadsheets.

A good example of a bad PDF document is the District Census Handbook of 2011. It will have reams and reams of pages, with one page showing the first half of a horizontal spreadsheet, titled “Industrial Catgory of”, and next page showing the other half, titled “Marginal Workers”. Man, that must be some innovation in bad design .

Switch to the good example of presentation of the same data. data.gov.in presents hundreds of datasets from the Census of 2011 in open formats for public access. You can access the data for the same district in CSV format, loadable into Excel. Did you know you could download the Indian Railways timetable in Excel ?. How sweet!

The salient rule in data collection or presentation must be that, at the raw source, the data must be collected in a format that is processible by automation. It should minimize human eye intervention only for reviews and green/red flags and throwing up exception patterns. Or for discovering insights of wisdom from "rich experience", which a chip can't discern. Days may have come when chips outsmart elders in experiential wisdom as well, LoL.

Companies like HowIndiaLives.com , where my friend Ramnath is involved, address this problem by helping their customers and website visitors make visual sense out of the non-sense data in India's public domain. They also present it in a beautiful and user-friendly ways and help project stakeholders glean useful insights from data. That data shouldn't have been nonsensical in the first place, in its raw open form, is the sad fact. That they converted rants like these into a business opportunity is their ingenuity. While raw data must be disseminated in open processible formats, it should lead to an ecosystem of companies like these, which compete in discovering insights from data and presenting them, without having to spend too much time cleaning it. Not just cleaning it, but having to fight inconsistencies between multiple sources of data for the same item of information, I guess, must be another tedious task.

Taking a leaf out of international open data like data.un.org, the open data platform for India is a big leap in this direction for sharing of public datasets (though they don’t have options to bulk download data). 

On the other hand, some of the transactional websites of the government websites can make life extremely tedious. If you are a high-volume transaction submitter, your life can become miserable, having to submit thousands of records into old-style web forms. Some of them must have been in a cave since AJAX was invented. They can put thousands of person names with option buttons on a single page, expecting the user to scroll down or use the browser's Find, choose one name and then submit. Wwwhaaat!

The format in which data must be submitted to government websites must be pre-defined with digital processing in mind. For a billion-headed Titanosaur like India, it should definitely have scale in mind too. Ideally, it shouldn't even be Windows OS intensive and Windows OS requiring. Kerala, for example, wants to dabble in Linux and it's a good thing. I hope they don't up give up like Munich, the city that wanted to run on Linux. But, that kind of thinking is good. It may lead at least to the adoption of open formats for seeking data, if not an open-source OS.

Not all is a sad story with Digital India. There are a handful of bright spots in good design, that take scale and digital processibility into account. Aadhaar, no doubt, is a beautiful example. The Income Tax website often has some sudden quirky differences between its Java tool and the Excel tool with mysterious conclusions of inability to generate the XML file. But it at least uses XML to upload data. Thats a good thing. Even within the Income Tax Department, you may not find the same kind of good design for other tasks, for instance, for applying for non-deduction of TDS. Another of my favorite examples of handling technology at scale carefully is SBI's transition to core banking and their merger with associate banks. It was not about open data or about government per se. But, at its scale, it's truly a project of teaching elephants to dance, and for their size, they did a mighty good job at it. The GSTN must be the next Aadhaar-like unifier, after the easing out of the initial troubles.


This is the intermission, the end of Part 1. 

In Part 2, I mention some of the woes I faced while extracting data from PDFs from the SEBI website. It has a sad ending that rounds up by saying :

Some data dinosaurs have to be taught to dance digitally, because evolution is binary. 
You either become distinct or you become extinct.  

Lot of technical debris ahead on Part 2. Blissful poets, musicians and other non-tech readers not allowed beyond this point. :-) :-)

-->> Part 2 


Saturday, October 21, 2017

Book Reflections : The Shattering of the Soul

How long does a good person remain good ? The true test of the goodness of a person may lie in extreme conditions that test it. They say the strongest of tyres are tested on the toughest of roads. Is your neighbour good, will he remain good to you in times of trouble ? Wait a minute, ask muslim widows who are victims of the Bosnian war. We don't get a very enthusiastic answer that we normally face. Most of us live under "normal" conditions, so it helps to assume an average goodness in the people around us. It's necessary too. However, in times of war, or war-like riots, the same human, acts differently, as if possessed by a war ghost. 
 
The book, "The Shattering of the Soul", which I just completed, captures this aspect of human nature. It captures the stories of war misery of 10 Bosnian muslim women, in a first hand account of their experiences on how the ethnic cleansing by Bosnian Serbs during 1992-95 unfolded, and how their lives were changed overnight. All the accounts have plenty of the events in common, that makes it a little repetitive in detail. They would all say, roughly, “We had a house and a farm, and we grew our food. War broke out, we were invaded and looted. We fled. We want to go back to our roots, but what is left but ruins?”. 
 
But, if you read one story at a time, during train travels as I did, you see the common thread not just of the events, but of both evil and good in man. You see that all grief is similar to the onlooker, yet each grief is different for the victim. The feelings of the common people in a typical village are so different from the ones who might have initiated the wars, but the stories of war travel far and wide to create more wars and more misery. It is as if the ethnic war ghost is a virus that spreads like an epidemic. It spreads, not through touch or food, but through the shreaks in the voices and fiery red eyes of patients infected with hysteric rage. It causes a clouded vision of the world and makes you hate thy neighbour as your enemy. 
 
The Museum of Tolerance provides an online version of the book for free : http://motlc.wiesenthal.com/site/pp.asp?c=gvKVLcMVIuG&b=394691
 
The book speaks of how the Bosnian families were protected by the Serbian neighbours of the same village, although it was Serbians who looted them. The Serbian neighbour would stand up for them, they would stop their Serbian soldiers and say, “This one here is a decent family. They don't have weapons. Spare them.” . Yet another would say, “Take my life before you touch that child”. Some others may not stand as upright, but they would smuggle cheese and other food supplies for their Bosnian neighbours. Some would warn them in time so they could go into the woods and stay for days, till the invaders came and looted their houses and went back. After the houses were devastated, some would at least call them for a coffee in the afternoon to their house. How many of us can manifest goodness in the face of threat to our lives ?
 
They also speak of how, in some other cases, the very same Serbian neighbours who were close until the previous day, would participate in the loot of the Bosnian house. Some said they had to point guns at their Bosnian muslim neighbours, because otherwise, their own Serbian clan would kill them. They would make the youth from the Bosnian families work like a slave. The victims mention how they were clueless that the very faces whom they met across the street everyday would land at their door, demanding to chase them out and loot their houses. 
 
Until then, they were neighbours who helped each other build their houses. The houses were built by the neighbours lending a hand to each other, except for the roof, which would be given to the professional. The houses that were self-built and built as a shared labour between Bosnian and Serbian families would be destroyed, looted, the doors and windows or whatever was left just taken away by the invaders. Families with children had to move over to Slovenia, leaving all property back in their village, travelling long distances, even having to bribe for their paperwork to move out. Mosques on the way would arrange some food for the children of the migrating victims. The stories distinctly recall, how it was all fine till one day when the war started and the news of war arrived in the village and neighbours become archenemies. 
 
Which of these two is true human nature ? How does one know which part of Man will manifest when ? I can't help but think of similar stories from the Gujarat Riots of 2002 or the exodus of North-East people from Bangalore in 2012. 
 
As the compiler of the stories admits, the book captures only the view of select Bosnian Muslim victims, there are no stories about Serbian or Croatian victims, which must be equally mentioned. But as the epilogue argues, that is not much relevant. "Human suffering due to mutual hatred is universal, and by presenting the suffering of some we are presenting the suffering of all". 
 
Sri Ramakrishna tells an interesting story about two brothers fighting for land. They were on either side of the disputed border and were quarelling at the top of their voice, about the patchy border. “It's mine”, one said. “No, it's mine”, yelled the other. Voices grew into arms, arms grew into bruises, bruises grew into attacks and soon they both dropped dead at the border. God, who was watching the fight from above, felt funny. “Well, whose land is this now ?” He asked. There were no owners left to answer. 
 
After I read the book, I felt like listening to A R Rahman's song from 1947 Earth : "Ishwar Allah Tere Jahan Pe". It's a beautiful song that captures the questions that would have, surely arisen in the minds of those war victims rendered homeless, with their souls shattered and their hopes killed. From the ruins of their houses and ashes of their families, some seeds of hope must have flown across the Slovenian border. They wanted to come back and they wanted to live. But they had to choose between the two. The ghost of war abandoned their villages, and now went to possess some other race, tribe or religion, elsewhere on earth. But they had to struggle, rebuilding their lives and houses in another distant land. This time, without a neighbour, to lend a helping hand. 
 
Like that song asks: 
 
So many screams, who will hear the voice of love ? 
So many dreams shattered, who will gather the pieces ? 
 
The song is verily a Prayer for Peace. 
 

Tuesday, June 20, 2017

Impressions from Lokmanya Tilak

To read a book on Lokmanya Tilak's life had been on my to-do list for a long time. There were three reasons for this :

1. His work ethic : I had read somewhere that he was almost a workaholic, he worked long hours relentlessly for the society and country, more than people do at their jobs. He was a very versatile learner and worker and in his life, he did all kinds of things. He was a journalist, edited two magazines, taught law
classes, was a maths teacher in school and college, a lover of trigonometry, started a school, a college (Fergusson College, Pune), did social work during the plague, researched Vedic History, wrote a commentary on the Gita, he even ran a sugar factory for a while. How could one person be and do all these things ?

2. I had heard that, among the freedom fighters, Swami mentioned Lal-Bal-Pal particularly in praise.

3. Having read and written on Gandhi at my blog, it would be great to acquire a perspective totally different from Gandhi. Tilak and Gandhi mutually respected each other, but during their time, it was clear that their paths were different and they knew it too. Tilak debated this with Gandhi and tried to persuade him to give up non-violence and didn't succeed at it.

I wanted to understand Tilak the man, his personality, his early years and what made him to be what he was. So I stepped into the university library after 10 years and picked up Tilak's biography. by Dhananjay Kheer. After the first few chapters, the book turned out to be more on his political chronography, he did this at that Congress meeting, then Congress met next year, then he did that, again Congress met and so on. Nevertheless, I could observe many events in his life and his views and some of them I never had an inkling about. Here are some points and anecdotes that I found interesting and inspiring.

Tilak the social worker :

He was a man who worked amidst the masses. We all know that he used Shivaji's Birth anniversary and Ganesh Chathurthi as platforms to raise the patriotic awareness of the people. He believed in ground action, to be with the people.

During the Poona plague, the British appointed committees to segregate patients. This was to be implemented by British soldiers. There were reports that they were acting too harshly and there were excesses. The British Government said, it was just doing its medical job. Tilak joined the search teams himself, visiting house to house, to ensure the British soldiers didn't commit excesses. He also created awareness in the public about hygiene in slums and urged the Hindus not to stick to old superstitions and stay away from hospital treatment. He started a Hindu hospital where Hindus were treated at their expense. He started a free kitchen in the segregation camp to help the poor.

Tilak believed that it was the duty of the people to see that the government implements laws effectively. He said it was the responsibility of the local leaders, and if they are prosecuted for it, they shouldn't mind suffering imprisonment for the good of the people.

He criticized the Indian National Congress often, from within, of the same set of elite people, meeting from time to time, the same set of resolutions about working "along with" the British Government, what was the use ? Get peasants! Help them solve problems of land revenue, salt, forest and excise under which they they are crushed. He told the farmers to pay the govt dues if they had money, but not to do so by contracting debts! He travelled from village to village to gather farmers support for the struggle.

His was possibly the first agrarian movement in support of independence. The freedom struggle probably was the last truly national movement in India, that involved all sections of society. I couldn't miss out on the comparison with anti-corruption movement in India in 2011. Where were the masses ? The office-going urbanites and the mouse-clicking social media were there, but where were the people who are often at the last receiving end of corruption ? Who travelled from village to village to communicate to them and collect them ? 

Tilak the Leader :

Over time, Tilak reached a point of belief that the leader should do what the people want, but are unable to express or unable to do. He maintained a very strong regional identity, he was highly respected among the Maharashtrian freedom leaders. A Congress session at Pune or Bombay would be unthinkable without Tilak's participation. But he also established a great rapport with like-minded leaders in other far-flung areas, such as V.O.Chidambaram Pillai, Aurobindo etc. It must have been quite difficult to stay within the Congress and fight its lethargy and engage in dialogue with critics in the same conference venue. Yet, he would reach a common ground, if the overall unity of the Indian National Congress or the overall interest of the Nation was paramount. One could have easily expected a strong, independent and fiery mind like Tilak to have broken away into a separate party, out of frustration and impatience, or to be removed for his insistent approach, but neither happened. Other leaders sought him out for his views, even as they knew he may not agree with them.

Tilak refused to plead guilty, although if he had done so, it would have reduced his prison sentence in the sedition case. I was curious to know, did he regret it later ? In contrast, a few weeks back, Gokhale had given an unconditional apology for his speaking up while he was in Britain. Savarkar too was forced to tender an apology and undertaking to refrain from millitant activity.  Tilak was no such man. It looked like his diabetes in the prison and weak health changed him a bit. That, and his age, did it change his extremist views, did it soften them? Who can have a peek into how the great minds transition ?

At many points in Tilak's life, he must have faced the conflict of improving the Indian society versus fighting the British, and the conflict of having to support other methods which were different from him. He must have handled the conflict of what is good and bad for the country at that time, or what was a lesser evil in the longer interest. The ethical dilemmas that a leader faces in a real life working for the society are so different from the ones taught in the story-telling classrooms. How does a leader act when all you see around is misery and conflict and there is no one to raise the people's awareness ?

There was once a strange case when Tilak fought a case for the corrupt, while condemning them in Kesari. An English official collected bribes from 17 mamlatdars for favouring them promotions. The corruption came to light. The British lured the Mamlatdars saying if they confessed who the official was, no action would be taken against them. After the English official was named, the British backtracked. They didn't want to put out a ugly picture that an English official was corrupt, so they changed his offence into some minor stuff and let him free. They then went after the Mamlatdars and dismissed them for paying bribes. Tilak said,  the British went back on their word, fought the case for the Mamlatdars, won it and had them re-instated. He then condemned bribery and the Mamlatdars who offered bribes. I guess, he must have reasoned, between corruption and the British, fight the bigger evil first!

There was a case where Tilak might have actually concealed the wrong-doer. One Damodar Chapekar, acting on his own, had shot dead a British official who had acted tyrannically during the plague. At dawn next day, he sent a message to Tilak that said, 'The previous night the Ganesha at Ganeshkhind had been propitiated'. Tilak probably knew about some rough plans, he immediately understood the message and exclaimed, 'Is it so? Then be cautious now!'. He later wrote in his paper about both the police raj and also that the culprit should be nabbed and due course of law should be followed. He later said, when asked by the officer : 'I can't help you. Even if I have information, I will never pass it on to you. I believe offender should be punished adequately, but I will never agree to be anybody's spy and never will I betray anyone in the world. But I won't put obstacles in your path. The murder is a blot upon Poona, when found, the offender should be punished as per law.' Later, when Damodar was apprehended, he requested Tilak, who was in jail, to draft his appeal. Tilak did so. Damodar carried Tilak's copy of Gita to the gallows.

Tilak the Man :

He was a man of amazing personal integrity, the kind it's probably impossible to find nowadays in public life. His idealistic approach often put him in conflict with others, but he was a man of strong convictions.

Once, Tilak agreed to be the executor of the will of a dying friend Shri Baba Maharaj. After his death, his relatives falsely accused Tilak and disputed the will. Tilak had to go through the court case for 19 years without compromise and won the case, because he had given a word to his friend!

Once, when a revolutionary sent a diamond as a gift to Tilak from abroad, he ordered it to be sold and the proceeds to be used for the independence struggle.

He wanted members of his institution (Deccan Education Society) to follow a simple spartan lifestyle. He was against regular automatic pay-rises, a fact some of his married colleagues had problem accepting, because they felt, according to market conditions, pay has to raise. Also, Tilak said, if members did work outside the job, the remuneration from outside work will belong to the society's common fund. These were very ideal beliefs, and he himself practised them but it was impractical for others to follow.  He had to, unfortunately, resign out of the very college he started, because of extreme views, which he considered as matters of principle and his colleagues considered highhandedness.

Tilak encouraged his friends and colleagues to have a rest vacation, once a year, in seclusion. Tilak's book 'The Arctic Home of the Vedas' was written during one such, in Singhad after his release from Yeravada Jail. In jail, he had received a flash insight from the Vedic sentence :'The Sun rose after many days', which was an inspiration for the book.

He had a great ability to revert to calm, in the face of danger. When the police surrounded his house for writing seditious articles in Kesari, he quietly surrendered. By the time the court official went to process his bail, (which was denied) and returned, he found Tilak happily snoring in the cell. Haha, cool as a cucumber.

He used his stay in the prison to write a commentary on the Bhagawad Gita. What an intellectual realm our leaders maintained even in the prison those days! Aurobindo had a vision of Lord Krishna, Vinobhaji learnt 4 languages and gave talks on the Gita to fellow prisoners! Tilak favoured an "activism" version of the understanding of the Gita and saw Karma Yoga in the light of patriotism and service to the country.

When he was in prison, other prisoners had great respect for Tilak. The jail authorities sometimes used Tilak's moral authority to tame the rogue prisoners, Tilak's word had the magical respect with them.

There were a few things I found myself disagreeing while reading the book. His extreme orthodoxy and casteism, his belief that independence should precede social reform and not go hand in hand, his extreme pro-Hindu leanings etc. Strangely, he opposed the increase of minimum age of consent for marriage from 10 to 12 by the British. Publicly, he opposed it saying it was against the tenets of Hinduism. Privately, he agreed that it was okay to raise. But he opposed it on two grounds (a) Who is the British to meddle with Indian tradition ? Let Indians decide for indians. (b) Change in Hinduism has to come from within itself, not forced from outside. But some of these were probably just a function of his times while we view it in the modern context of progressiveness.

We know that Tilak met Swami Vivekananda and Shirdi Sai Baba. Let me close with a few more sweet #TIL snippets :

1. Guess who fought the defence of Tilak's Kesari sedition case in 1909 ? Mohammed Ali Jinnah.

2. Max Muller petitioned the British Government for release of Tilak.

3. Much before he appeared on the Indian freedom scene, Gandhi met Tilak, Gokhale and other leaders seeking their support for his South African movement.

4. When Tilak visited Cambridge, he gave a brilliant talk on why Indian students studying there should go back to India after their studies and dedicate themselves to the cause of the nation. Guess who was in that student audience : Subhash Chandra Bose! . Now, that should be the biography I should pick up next.

 
THANK YOU: These reflections draw sometimes from readers and friends who initiate ideas, build up discussions, post comments and mention interesting links, some online and some over a cup of coffee or during a riverside walk. Thank you.

Disclaimer: Views expressed in this blog are the blogger's personal opinions and made in his individual capacity, sometimes have a story-type approach, mixing facts with imagination and should not be construed as arising from a professional position or a counselling intention.