Munging and liking it: wrangling patient time-related data with IPython
Niki Kunene Eastern Connecticut State University
Abstract The uptake for machine learning techniques in healthcare has lagged industry for multiple reasons including an underdeveloped IT infrastructure. Data complexity and variability of healthcare data, and that time is a pivotal factor in data are some of the other cited reasons for the slower uptake of machine learning in healthcare. In this paper we examine the use of Python/Pandas wrangle a unique healthcare monitoring dataset with several time-related variables, in preparation for machine learning and time series analysis. In knowledge discovery and machine learning projects, data wrangling or data munging is the process of data understanding and preparation where data is identified, extracted, cleaned and integrated.
Data scientists reportedly spend up to 50-80% of their time preparing and managing data for analysis, only a fraction of an expert’s time is spent on value-added exploration. Although efforts towards automation using artificial intelligence are growing, most of this so-called ‘data janitor work’ is still performed, to a large degree, manually. Some companies have used crowdsourcing to address the need. Varying contexts and unique datasets have meant a generalized automated/artificial intelligence solution still eludes us. There are commercial tools that support data wrangling, with Trifacta a market leader. For more readily accessible and cost-effective options, the omnipresent Microsoft Excel is widely-used by practitioners in some form, perhaps in conjunction with a text editor/source code such as Notepad++. Free data wrangling tools include: DataWrangler; Tabula; OpenRefine; “R” Packages”; CSVKit; and Python/Pandas.
For large projects, teams may automate some of the repeating tasks using tools at their disposal. For small projects or unique datasets, the tools used for data wrangling need to offer efficiency, flexibility, ease of use, and ease of documentation or annotation. We believe instructional needs share commonality with the needs of small projects in this instance.
We examine the use of Python/Pandas to munge a unique healthcare monitoring dataset from an ICU facility. The data includes several time-related variables; the wrangling is done in preparation for machine learning and time series analysis. Specifically, we examine the use of IPython, the kernel pf Jupyter Notebook. We evaluate the tool with respect to criteria from the literature: flexibility, efficiency, ease of documentation, error tolerance and feedback. We found the tool high levels of: flexibility for formatting and recoding raw data; efficiency particularly for repeating by-patient tasks for time series analysis; and documentation ease via commenting about the executable code. We found some error-tolerance, particularly with errors related to versioning issues with suggestions for future-proofing code. The degree of feedback is also impressive, allowing even novice users to pinpoint precisely where their code fell apart. We found Jupyter Notebook easy to learn, however one must also learn Python and Pandas dataframes.
Recommended Citation: Kunene, N., (2019). Munging and liking it: wrangling patient time-related data with IPython . Proceedings of the Conference on Information Systems Applied Research, v.12 n.5255, Cleveland, Ohio