Data democratization is as a matter of fact on its way, but after talking to quite a few practitioners or practitioners-to-be in different forums, it turned out that many questions remain open and although some common basis is there, the scope and magnitude is still unclear. Especially with the advent of Internet of Things (IoT), where machine generated data is constantly increasing its share, democratization policies shall allow for maximization of the value generated out of the data. While the need for data democratization is universally accepted and seen as a must have and as a real business drivers, a written and published declaration of motives, views and intentions (a.k.a. a manifesto) shall provide common understanding. That’s what made us emulate other similar movements, such as Agile Software Development, Agile Data Science or User Data Manifesto 2.0, and come up with a Data Democratization Manifesto. That is, while there is value in the items on the right and might be justified back in the time of a central Business Intelligence setup, we value the items on the left more. Principles behind this data democratization manifesto We follow these principles: The highest priority is to maximize the value of the data to drive business results. Business people and data scientists must mesh, working together on a daily basis all through projects and ensuring that insights are transformed into actions. While the duty of the business side is providing the maximum context and expert knowledge for data scientists to guide their models; the duty of data side is incorporating the maximum relevant knowledge from business experts into the models. Business and Data approaches must share the same targets. Data science shall be measured by the business value obtained from it. Every decision encompass all relevant knowledge available in the company and outside the company –as long as data protection policies are not compromised-. If there is a data source that might be relevant for the decision, it must be immediately made available. It’s up to the business data scientists to decide on the relevance of a data source. If there is a business need, new data sources need to be made available. The cost of acquiring these data sources should be weighted to the monetary impact of the decisions supported by these sources. Data has an opportunity window: the sooner it gets to where decisions are made, the higher the value. Democratizing data does not mean creating new silos. If new data is available on one end of the company, it must be made available for the entire company (always respecting data protection policies). Data Business and Technical Catalogue and Data Lineage practices are essential elements implementing a data democratisation strategy. Data-driven decision making is an infinite loop, where knowledge is to be shared via proper documentation and communication. Final comments and disclaimer To complement this post, I recommend reading the Royal Statistical Society Data Manifesto, which is also handling the data democratization topic but from the government perspective. While our purpose is to focus on data democratization for enterprises and businesses, it is encouraging to see how both manifestos are aligned. As a final note, I’d like to say that this is just a proposal, a beta in working state that encapsulates all aspects we understand under Data Democratization. If you want to contribute to it, please feel free to email us or participate in the forum discussion. Likewise, you can provide your comments right here.
When companies hire a data scientist -putting apart the hype and the “we hire data scientists because everybody else is doing that”-, the expectation on her/him goes along these lines: “I sit on a pile of data and I want you to generate all the insights I need to steer my company”… But without a proper enabling groundwork, you are going to probably find the body of the data scientist on the shore of your data lake… the poor guy who inevitably drowned due to irresponsible or inexistent data management. The refinery without crude Using the “data is the new oil” mantra -which I’m starting to hate-, it would be similar to: “Somewhere in this area I got a lot of crude oil; do whatever you guys do to create fuel”. Well, it is exactly what a refinery does, isn’t it? Not quite… a lot of work is required before the distillation process starts in the refinery: a first phase of crude extraction, exploration for quality and volume, creation of a drilling rig, well evaluation and completion… After that, the process of taking the crude to the refinery starts: transportation in oil tankers or pipes, lightering, etc. You can have a look at Adventures in energy… really well explained and fun to read. Back in our data world, it is important that companies don’t oversee what comes right before the data distillery into insights, which I call Data Science Enabling. If this enabling work has not taken place yet in your company, you end up having your poor data scientists new hires inevitably drowning in your data lake -if you have one- or in one of your siloed data reservoirs. After “drowning”, these guys can either leave the company to join a better refinery, or try their best as data engineers, trying to dig out data from somewhere, distill it as far as they can to potentially come with insights that are not usable… Without the proper raw material you cannot produce any combustible. Drowning in a data lake: know the symptoms I’ve been leading many data scientists teams over the course of the years and I’m a data scientist myself -I consider myself one-. When you lead a data scientist team, you are accountable for the results and you are supposed to do whatever it takes to get your team members in a position of delivering them (a.k.a. you cannot let them “drown” by making sure the right material reaches in the right way and at the right pace the refinery). But how do you know Data Science Enabling is yet to be done? Your data scientists tell you that: ? I don’t know which data sources are there or which ones are relevant for me. Often happens that you are working with a data source, and well advanced in the process, somebody comes around the corner with a much better data source with information you have started to infer… Knowing upfront which data sources are available saves time, resources and contributes to the quality of the results. ? I don’t know where the data is available: I don’t know how to access the dataTypical example, you are said that we have this wonderful data source in a proprietary database… you ask to get access and nobody can help there… no API and no services on top… what do you do? I don’t have the tools (or I can’t install it in my company laptop)Accessing corporate data is subject to security and data protection policies -the way it should be-. Often, you are given company equipment with a lot of restrictions (no Admin rights, you cannot install anything, etc). Yet, you need your tools to start making sense of the data… so you need help to navigate this hurdle. I don’t have the permissionsObviously, nobody can have access to everything. Yet the process of granting access to a particular cut of the data is often not well documented, technically not possible or it is part of a tedious ever lasting request process. I don’t know whether I’m indeed supposed to access a given data sourceWho can access what is often not well defined. Processes for granting temporary access to a particularly sensitive source (e.g.: via Non-Disclosure Agreements, etc) are often accountability-orphan. And typically those who want to help can’t, those who can’t don’t care. I don’t know whether I can copy, modify, persist, etc the dataOnce you get access to a particular sensitive source, there are most probably guidelines with Do’s and Don’ts, but often unclear. ? I don’t know the meaning of the data fields and the correspondence to business information I see a lot of Id’s I can’t connect to anythingOften I’m delivered with just “facts” tables, but I miss the dimensions… so I end up with many funny named Ids I can’t do anything with, even if I feel they play an important role. I don’t know how to aggregate my data into broader entities according to the business standardsOften business taxonomies are not supplied. Often they don’t even exists or worse, there are several versions floating around in the company that are not quite compatible. Which one to use becomes a Russian roulette decision. ? I don’t know if I’m reinventing the wheel I don’t know if somebody faced and solved the same problem or a highly related problemIn big corporations, it is not rare that you are in the middle of a project and you get to know that somebody has already or is still trying to solve the same issue… but you get to know it by chance… there’s no system to check for this information (a.k.a: knowledge management just missing) I don’t know if when somebody claimed to have solved the same problem, it is true.Or even worst, your project get stopped or challenged, because somebody put on a power point, that they have already done that… but when you scratch the surface, there’s nothing behind. I don’t know the quality (MAPE, MAE, accuracy error) of the approach chosen by somebody who solved the same problem.But let’s say there is something done… often quality metrics are missing, so if your method is better or worst remains unknown, because the existing solution does not provide any quality metric. ? I really have issues understanding the data I’m missing the business annotations giving the feedback on how to solve the problem or to explain the dataData consumers can enrich the data the best, because they have business context (e.g.: during the release week-end, we don’t see any orders… is it a tracking issue? was the weather was so good? difficult to guess, there was a release, that’s why annotations are a must have) I’m don’t know whom I can talk about the data or where the process picture isProcesses built up as different statuses in the data can’t be easily understood… The data scientist can infer the process by identifying combinations of statuses with a timestamp but again this is to certain extent guesswork. The responsible for the process can be of much more help. ? I don’t know if the data I’m getting access to is kept up-to-date or is complete I know it is just a sample, but I don’t know how the sample has been takenSometimes data scientists are given a dump of data, sometimes somebody just took a sample. Data sampling is per se a prolific research area with thousands of papers written every year. Also the way you create a data sample defines the entire data science process and most probably impacts the quality of the results I don’t know by when I’m expecting fresher dataTo prevent overfitting/underfitting a model fresher data can be of great help. Not knowing when new data is going to arrive, enormously challenges the data science job. ? I don’t know whether the data is consistent along the time line I don’t know if any algorithm of data correction has been appliedIt is not unusual, that the data presents gaps… sometimes, somebody correct these gaps, but the remedy can be worst than the problem if not properly done (e.g.: interpolating the sales of a bank-holiday). Not knowing it, might lead the data scientists to draw wrong conclusions. I don’t know the reason why there are gaps in the data (if any)A release, a bank holiday, a system outage, a change in the logs, or just something looking like a gap, that is not a gap… without proper documentation it is difficult to know how to deal with the gaps. I don’t have an indicator for the completeness of the dataThis is similar to the sampling, only forcibly done by the measuring system. Let’s say you are analyzing an incidents log and the application registers only 75% of the incidents… Knowing that would be precious if you are tasked with creating an early warning system, don’t you think so? ? I don’t know whether the data is consistent with other data sources I don’t know in which other additional data sources is the same information availableLet’s say your model is based on one particular data source, which is slightly inconsistent with another one (you didn’t even know it existed)… your model is going to be conflicting with the findings in some other parts of the organization, and your existence in the company turned into hell. I don’t know whether different users have the same access to a data source or each one has their own copies -in which case I don’t know if they are 100% aligned-This is another aspect… local copy hardly ever updated… or slightly modified… snapshots you need to identify, because your local copy might become completely inconsistent over the time. ? I don’t know in which platform I can run my analysis or publish the results I don’t know if my laptop can process so much informationWe all know that… data is getting big… but apparently sometimes companies think that most of the tasks can be done in your own PC… Certainly many of them can, but not all! When I’m ready with my analysis, I don’t know where to deploy it within the context of a data productNowadays, good data scientists don’t just provide the results of their analysis… they go beyond that and create data products. A data product is a piece of software which needs to be hosted on a platform and needs to be fed with fresh data. I don’t have any environment with the tooling and the best practices processes of standard software developmentSoftware development best practices also apply to data products: version control, repositories, continuos integration, testing, etc. Probably none of these components have been made available for the data scientists to properly work. What can be done before your data scientists start to sink? Basically, the symptoms we just discussed are the sign of a bad/non-existent quality management. Companies with decent aspirations in the data science domain should take data management very seriously. The role of a Data Quality Manager needs to be understood, well staffed and empowered. Before a data scientist is thrown in the middle of the data ocean, there are some criteria that shall be fulfilled. In the picture below, I provide -what I think is- a quite useful check list where the most relevant Data Readiness criteria are listed and waiting for getting a tick on their boxes. Sticking to this list can be literally a life-saver or the guarantee for success making the most of your data! Data Readiness Checklist – Click here to enlarge I’d even go beyond that and establish the Data Readiness as a Conditio sine qua non (a.k.a. strong prerequisite) for a Data Lake. In other words, stop calling it a Data Lake if your Data is not usable… or start calling your data scientists “magicians” or “wizards“: This is the best definition of a #datascientist you can get if you don’t have the proper #datamanagement in place! pic.twitter.com/KKcZ2g6LYk — Juan Bernabe (@_juan_bernabe) May 7, 2016
MAKE IT MODERN
The FDA issued a draft guidance urging sponsors to work to control error rates before launching clinical trials with multiple endpoints, and provided several examples of appropriate statistical methods.
The number of people dying from chronic obstructive pulmonary disease has climbed nearly 12 percent over a 15-year period, an analysis for the Global Burden of Disease Study 2015 has found.
GlaxoSmithKline has kicked off a Phase III study exploring the potential of its biologic mepolizumab in patients with severe bilateral nasal polyps, a chronic inflammatory disease of the nasal passage linings or sinuses causing soft tissue growth in the upper nasal cavity.
Research led by the University of Birmingham indicates that a GLP-1 agonist drug, currently used to lower blood sugar in people with type II diabetes, could be repurposed to treat raised brain pressure.