- Big data analytics is turning out to be one of the toughest undertakings in recent memory for the healthcare industry.
Providers who have barely come to grips with putting data into their electronic health records (EHR) are now being asked to pull actionable insights out of them – and apply those learnings to complicated initiatives that directly impact their reimbursement rates.
For healthcare organizations that successfully integrate data-driven insights into their clinical and operational processes, the rewards can be huge.
Healthier patients, lower care costs, more visibility into performance, and higher staff and consumer satisfaction rates are among the many benefits of turning data assets into data insights.
The road to meaningful healthcare analytics is a rocky one, however, filled with challenges and problems to solve.
READ MORE:Understanding the Many V’s of Healthcare Big Data Analytics
By its very nature, big data is complex and unwieldy, requiring provider organizations to take a close look at their approaches to collecting, storing, analyzing, and presenting their data to staff members, business partners, and patients.
Data Science Projects For Final Year
What are some of the top challenges organizations typically face when booting up a big data analytics program, and how can they overcome these issues to achieve their data-driven clinical and financial goals?
Capture
All data comes from somewhere, but unfortunately for many healthcare providers, it doesn’t always come from somewhere with impeccable data governance habits. Capturing data that is clean, complete, accurate, and formatted correctly for use in multiple systems is an ongoing battle for organizations, many of which aren’t on the winning side of the conflict.
In one recent study at an ophthalmology clinic, EHR data matched patient-reported data in just 23.5 percent of records. When patients reported having three or more eye health symptoms, their EHR data did not agree at all.
Poor EHR usability, convoluted workflows, and an incomplete understanding of why big data is important to capture well can all contribute to quality issues that will plague data throughout its lifecycle.
READ MORE:Turning Healthcare Big Data into Actionable Clinical Intelligence
Providers can start to improve their data capture routines by prioritizing valuable data types for their specific projects, enlisting the data governance and integrity expertise of health information management professionals, and developing clinical documentation improvement programs that coach clinicians about how to ensure that data is useful for downstream analytics.
Cleaning
Healthcare providers are intimately familiar with the importance of cleanliness in the clinic and the operating room, but may not be quite as aware of how vital it is to cleanse their data, too.
Dirty data can quickly derail a big data analytics project, especially when bringing together disparate data sources that may record clinical or operational elements in slightly different formats. Data cleaning – also known as cleansing or scrubbing – ensures that datasets are accurate, correct, consistent, relevant, and not corrupted in any way.
While most data cleaning processes are still performed manually, some IT vendors do offer automated scrubbing tools that use logic rules to compare, contrast, and correct large datasets. These tools are likely to become increasingly sophisticated and precise as machine learning techniques continue their rapid advance, reducing the time and expense required to ensure high levels of accuracy and integrity in healthcare data warehouses.
Storage
Front-line clinicians rarely think about where their data is being stored, but it’s a critical cost, security, and performance issue for the IT department. As the volume of healthcare data grows exponentially, some providers are no longer able to manage the costs and impacts of on premise data centers.
READ MORE:Which Healthcare Data is Important for Population Health Management?
While many organizations are most comfortable with on premise data storage, which promises control over security, access, and up-time, an on-site server network can be expensive to scale, difficult to maintain, and prone to producing data siloes across different departments.
Cloud storage is becoming an increasingly popular option as costs drop and reliability grows. Close to 90 percent of healthcare organizations are using some sort of cloud-based health IT infrastructure, including storage and applications according to a 2016 survey.
The cloud offers nimble disaster recovery, lower up-front costs, and easier expansion – although organizations must be extremely careful about choosing partners that understand the importance of HIPAA and other healthcare-specific compliance and security issues.
Many organizations end up with a hybrid approach to their data storage programs, which may be the most flexible and workable approach for providers with varying data access and storage needs. When developing hybrid infrastructure, however, providers should be careful to ensure that disparate systems are able to communicate and share data with other segments of the organization when necessary.
Security
Data security is the number one priority for healthcare organizations, especially in the wake of a rapid-fire series of high profile breaches, hackings, and ransomware episodes. From phishing attacks to malware to laptops accidentally left in a cab, healthcare data is subject to a nearly infinite array of vulnerabilities.
The HIPAA Security Rule includes a long list of technical safeguards for organizations storing protected health information (PHI), including transmission security, authentication protocols, and controls over access, integrity, and auditing.
In practice, these safeguards translate into common-sense security procedures such as using up-to-date anti-virus software, setting up firewalls, encrypting sensitive data, and using multi-factor authentication.
But even the most tightly secured data center can be taken down by the fallibility of human staff members, who tend to prioritize convenience over lengthy software updates and complicated constraints on their access to data or software.
Healthcare organizations must frequently remind their staff members of the critical nature of data security protocols and consistently review who has access to high-value data assets to prevent malicious parties from causing damage.
Stewardship
Healthcare data, especially on the clinical side, has a long shelf life. In addition to being required to keep patient data accessible for at least six years, providers may wish to utilize de-identified datasets for research projects, which makes ongoing stewardship and curation an important concern. Data may also be reused or reexamined for other purposes, such as quality measurement or performance benchmarking.
Understanding when the data was created, by whom, and for what purpose – as well as who has previously used the data, why, how, and when – is important for researchers and data analysts.
Developing complete, accurate, and up-to-date metadata is a key component of a successful data governance plan. Metadata allows analysts to exactly replicate previous queries, which is vital for scientific studies and accurate benchmarking, and prevents the creation of “data dumpsters,” or isolated datasets that are limited in their usefulness.
Healthcare organizations should assign a data steward to handle the development and curation of meaningful metadata. A data steward can ensure that all elements have standard definitions and formats, are documented appropriately from creation to deletion, and remain useful for the tasks at hand.
Querying
Robust metadata and strong stewardship protocols also make it easier for organizations to query their data and get the answers that they are expecting. The ability to query data is foundational for reporting and analytics, but healthcare organizations must typically overcome a number of challenges before they can engage in meaningful analysis of their big data assets.
Firstly, they must overcome data siloes and interoperability problems that prevent query tools from accessing the organization’s entire repository of information. If different components of a dataset are held in multiple walled-off systems or in different formats, it may not be possible to generate a complete portrait of an organization’s status or an individual patient’s health.
Data Science Projects Github
And even if data is held in a common warehouse, standardization and quality can be lacking. In the absence of medical coding systems like ICD-10, SMOMED-CT, or LOINC that reduce free-form concepts into a shared ontology, it may be difficult to ensure that a query is identifying and returning the correct information to the user.
Many organizations use Structured Query Language (SQL) to dive into large datasets and relational databases, but it is only effective when a user can first trust the accuracy, completeness, and standardization of the data at hand.
Reporting
After providers have nailed down the query process, they must generate a report that is clear, concise, and accessible to the target audience.
Once again, the accuracy and integrity of the data has a critical downstream impact on the accuracy and reliability of the report. Poor data at the outset will produce suspect reports at the end of the process, which can be detrimental for clinicians who are trying to use the information to treat patients.
Providers must also understand the difference between “analysis” and “reporting.” Reporting is often the prerequisite for analysis – the data must be extracted before it can be examined – but reporting can also stand on its own as an end product.
While some reports may be geared towards highlighting a certain trend, coming to a novel conclusion, or convincing the reader to take a specific action, others must be presented in a way that allows the reader to draw his or her own inferences about what the full spectrum of data means.
Organizations should be very clear about how they plan to use their reports to ensure that database administrators can generate the information they actually need.
A great deal of the reporting in the healthcare industry is external, since regulatory and quality assessment programs frequently demand large volumes of data to feed quality measures and reimbursement models. Providers have a number of options for meeting these various requirements, including qualified registries, reporting tools built into their electronic health records, and web portals hosted by CMS and other groups.
Visualization
At the point of care, a clean and engaging data visualization can make it much easier for a clinician to absorb information and use it appropriately.
Color-coding is a popular data visualization technique that typically produces an immediate response – for example, red, yellow, and green are universally understood to mean stop, caution, and go.
Organizations must also consider good data presentation practices, such as charts that use proper proportions to illustrate contrasting figures, and correct labeling of information to reduce potential confusion. Convoluted flowcharts, cramped or overlapping text, and low-quality graphics can frustrate and annoy recipients, leading them to ignore or misinterpret data.
Common examples of data visualizations include heat maps, bar charts, pie charts, scatterplots, and histograms, all of which have their own specific uses to illustrate concepts and information.
Updating
Healthcare data is not static, and most elements will require relatively frequent updates in order to remain current and relevant. For some datasets, like patient vital signs, these updates may occur every few seconds. Other information, such a home address or marital status, might only change a few times during an individual’s entire lifetime.
Understanding the volatility of big data, or how often and to what degree it changes, can be a challenge for organizations that do not consistently monitor their data assets.
Providers must have a clear idea of which datasets need manual updating, which can be automated, how to complete this process without downtime for end-users, and how to ensure that updates can be conducted without damaging the quality or integrity of the dataset.
Organizations should also ensure that they are not creating unnecessary duplicate records when attempting an update to a single element, which may make it difficult for clinicians to access necessary information for patient decision-making.
Sharing
Check .net framework version remotely. Few providers operate in a vacuum, and fewer patients receive all of their care at a single location. This means that sharing data with external partners is essential, especially as the industry moves towards population health management and value-based care.
Data interoperability is a perennial concern for organizations of all types, sizes, and positions along the data maturity spectrum.
Fundamental differences in the way electronic health records are designed and implemented can severely curtail the ability to move data between disparate organizations, often leaving clinicians without information they need to make key decisions, follow up with patients, and develop strategies to improve overall outcomes.
The industry is currently working hard to improve the sharing of data across technical and organizational barriers. Emerging tools and strategies such as FHIR and public APIs, as well as partnerships like CommonWell and Carequality, are making it easier for developers to share data easily and securely.
But adoption of these methodologies has not yet hit the tipping point, leaving many organizations cut off from the possibilities inherent in the seamless sharing of patient data.
In order to develop a big data exchange ecosystem that connects all members of the care continuum with trustworthy, timely, and meaningful information, providers will need to overcome every challenge on this list. Doing so will take time, commitment, funding, and communication – but success will ease the burdens of all those concerns.
Many newcomers to data science spend a significant amount of time on theory and not enough on practical application. To make real progress along the path toward becoming a data scientist, it’s important to start building data science projects as soon as possible.
If you’re thinking about putting together your own data science projects and don’t know where to begin, it’s a good idea to seek inspiration from others. At Springboard, we offer mentored bootcamps that culminate in capstone projects focused on solving a real-world problem using the skills acquired throughout the course.
Related: How to Learn Data Science without a Degree
In this post, we’ll share data science project examples from both Springboard students and outside data scientists that will help you understand what a completed project should look like. We’ll also provide some tips for creating your own interesting data science projects.
Data Science Projects“Eat, Rate, Love” — An Exploration of R, Yelp, and the Search for Good Indian Food (Beginner)
When it comes time to choose a restaurant, many people turn to Yelp to determine which is the best option for the type of food they’re in search of. But what happens if you’re looking for a specific type of cuisine and there are many restaurants rated the same within a small radius? Which one do you choose? Robert Chen took Springboard’s Introduction to Data Science course and chose as his capstone project a way to further evaluate Yelp reviewers to determine if their reviews led to the best Indian restaurants.
Chen discovered while searching Yelp that there were many recommended Indian restaurants with close to the same scores. Certainly not all the reviewers had the same knowledge of this cuisine, right? With this in mind, he took into consideration the following:
His modification to the data and the variables showed that those with Indian names tended to give good reviews to only one restaurant per city out of the 11 cities he analyzed, thus providing a clear choice per city for restaurant patrons.
Yelp’s data has become popular among newcomers to data science. You can access it here. Find out more about Robert’s project here.
Third and Goal (Intermediate)
The intersection of sports and data is full of opportunities for aspiring data scientists. A lover of both, Divya Parmar decided to focus on the NFL for his capstone project during Springboard’s Introduction to Data Science course.
Divya’s goal: to determine the efficiency of various offensive plays in different tactical situations. Here’s a sample from Divya’s project write-up:
To investigate 3rd down behavior, I obtained play-by-play data from Armchair Analysis; the dataset was every play from the first eight weeks of this NFL season. Since the dataset was clean, and we know that 80 percent of the data analysis process is cleaning, I was able to focus on the essential data manipulation to create the data frames and graphs for my analysis. I used R as my programming language of choice for analysis, as it is open source and has thousands of libraries that allow for vast functionality.
I loaded in my csv file into RStudio (my software for the analysis). First, I wanted to look at offensive drives themselves, so I generated a drive number for each drive and attached it to individual plays dataset. With that, I could see the length of each drive based on the count of each drive number.
Then, I moved on to my main analysis of 3rd down plays. I created a new data frame, which only included 3rd down plays which were a run or pass (excluding field goals, penalties, etc). I added a new categorical column named “Distance,” which signified how many yards a team had to go to convert the first down. Using conventional NFL definitions, I decided on this:
This hands-on project work was the most challenging part of the course for Divya, he said, but it allowed him to practice the different steps in the data science process: assessing the problem, manipulating the data, and delivering actionable insights to stakeholders.
You can access the data set Divya used here.
Who’s a Good Dog? (Intermediate)
Garrick Chu, another Springboard alum, chose to work on an image classification project, identifying dog breeds using neural networks. This project primarily leveraged Keras through Jupyter notebooks and tested the wide variety of skills commonly associated with neural networks and image data:
One of Garrick’s goals was to determine whether he could build a model that would be better than humans at identifying a dog’s breed from an image. Because this was a learning task with no benchmark for human accuracy, once Garrick optimized the network to his satisfaction, he went on to conduct original survey research in order to make a meaningful comparison.
See more of Garrick’s work here. You can access the data set he used here.
Amazon vs. eBay (Advanced)
Ever pulled the trigger on a purchase only to discover shortly afterward that the item was significantly cheaper at another outlet?
In support of a Chrome extension he was building, Chase Roberts decided to compare the prices of 3,500 products on eBay and Amazon. With his biases acknowledged, Chase walks readers of this blog post through his project, starting with how he gathered the data and documenting the challenges he faced during this process.
The results showed potential for substantial savings: “Our shopping cart has 3,520 unique items and if you chose the wrong platform to buy each of these items (by always shopping at whichever site has a more expensive price), this cart would cost you $193,498.45. Or you could pay off your mortgage. This is the worst case scenario for our shopping cart. The best case scenario for our shopping cart, assuming you found the lowest price between eBay and Amazon on every item, is $149,650.94. This is a $44,000 difference, or 23%!”
Find out more about the project here.
Fake News! (Advanced)
These days, it’s hard enough for the average social media user to determine when an article is made up with an intention to deceive. So is it possible to build a model that can discern whether a news piece is credible? That’s the question a four-person team from the University of California at Berkeley attempted to answer with this project.
First, the team identified two common forms of fake news to focus on: clickbait (“shocking headlines meant to generate clicks to increase ad revenue”) and propaganda (“intentionally misleading or deceptive articles meant to promote the author’s agenda”).
To develop a classifier that would be able to detect clickbait and propaganda articles, the foursome scraped data from news sources listed on OpenSources, preprocess articles for content-based classification using natural language processing, trained different machine learning models to classify the news articles, and created a web application to serve as the front end for their classifier.
Find out more and try it out here.
Audio Snowflake (Advanced)
When you think about data science projects, chances are you think about how to solve a particular problem, as seen in the examples above. But what about creating a project for the sheer beauty of the data? That’s exactly what Wendy Dherin did.
The purpose of her Hackbright Academy project was to create a stunning visual representation of music as it played, capturing a number of components, such as tempo, duration, key, and mood. The web application Wendy created uses an embedded Spotify web player, an API to scrape detailed song data, and trigonometry to move a series of colorful shapes around the screen. Audio Snowflake maps both quantitative and qualitative characteristics of songs to visual traits such as color, saturation, rotation speed, and the shapes of figures it generates.
She explains a bit about how it works:
Each line forms a geometric shape called a hypotrochoid (pronounced hai-po-tro-koid).
Hypotrochoids are mathematical roulettes traced by a point P that is attached to circle which rolls around the interior of a larger circle. If you have played with Spirograph, you may be familiar with the concept.
The shape of any hypotrochoid is determined by the radius a of the large circle, the radius b of the small circle, and the distance h between the center of the smaller circle and point P.
For Audio Snowflake, these values are determined as follows:
Find out more here.
Bonus Data Sets for Data Science Projects
Here are a few more data sets to consider as you ponder data science project ideas:
You can also find a wide range of free public data sets in this blog post.
Tips for Creating Cool Data Science Projects
Getting started on your own data science project may seem daunting at first, which is why at Springboard, we pair students with one-on-one mentors and student advisors who help guide them through the process.
When you start your data science project, you need to come up with a problem that you can use data to help solve. It could be a simple problem or a complex one, depending on how much data you have, how many variables you must consider, and how complicated the programming is.
Choose the Right Problem
If you’re a data science beginner, it’s best to consider problems that have limited data and variables. Otherwise, your project may get too complex too quickly, potentially deterring you from moving forward. Choose one of the data sets in this post, or look for something in real life that has a limited data set. Data wrangling can be tedious work, so it’s key, especially when starting out, to make sure the data you’re manipulating and the larger topic are interesting to you. This often are challenging projects, but they should be fun!
Breaking Up the Project Into Manageable Pieces
Your next task is to outline the steps you’ll need to take to create your data science project. Once you have your outline, you can tackle the problem and come up with a model that may prove your hypothesis. You can do this in six steps:
Generate Your Hypotheses
After you have your problem, you need to create at least one hypothesis that will help solve the problem. The hypothesis is your belief about how the data reacts to certain variables. For example, if you are working with the Big Mart data set that we included among the bonus options above, you may make the hypothesis that stores located in affluent neighborhoods are more likely to see higher sales of expensive coffee than those stores in less affluent neighborhoods.
This is, of courses, dependent on you obtaining general demographics of certain neighborhoods. You will need to create as many hypotheses as you need to solve the problem.
Study the Data
Your hypotheses need to have data that will allow you to prove or disprove them. This is where you need to look in the data set for variables that affect the problem. In the Big Mart example, you’ll be looking for data that will lead to variables. In the coffee hypothesis, you need to be able to identify brands of coffee, prices, sales, and the surrounding neighborhood demographics of each store. If you do not have the data, you either have to dig deeper or change your hypothesis.
Clean the Data
As much as data scientists prefer to have clean, ready-to-go data, the reality is seldom neat or orderly. You may have outlier data that you can’t readily explain, like a sudden large, one-time purchase of expensive coffee in a store that is in a lower income neighborhood or a dip in coffee purchases that you wouldn’t expect during a random two-week period (using the Big Mart scenario). Or maybe one store didn’t report data for a week.
These are all problems with the data that isn’t the norm. In these cases, it’s up to you as a data scientist to remove those outliers and add missing data so that the data is more or less consistent. Without these changes, your results will become skewed and the outlier data will affect the results, sometimes drastically.
With the problem you’re trying to solve, you aren’t looking for exceptions, but rather you’re looking for trends. Those trends are what will help predict profits at the Big Mart stores.
Engineer the Features
At this stage, you need to start assigning variables to your data. You need to factor in what will affect your data. Does a heat wave during the summer cause coffee sales to drop? Does the holiday season affect sales of high-end coffee in all stores and not just middle-to-high-income neighborhoods? Things like seasonal purchases become variables you need to account for.
You may have to modify certain variables you created in order to have a better prediction of sales. For example, maybe the sales of high-end coffee isn’t an indicator of profits, but whether the store sells a lot of holiday merchandise is. You’d have to examine and tweak the variables that make the most sense to solve your problem.
Glock manufacture date gen 4. According to the chart, a March '02 would be marked 'NOKxxxusa'. Am I confused here?If this is merely the month and year numbering conventions reserved for older pistols, can someone please point me to a current numbering convention where I can identify the year manufactured of the pistols I am researching?Thanks all in advance,rad. According to, I may be reading the info incorrectly on the site, but if a Glock has an s/n EGN xxxusa, doesn't 'E' indicate month January, 'G' indicate month July and 'N' indicate month March?
Projects Related To Data Analytics SoftwareCreate Your Predictive Models
At some point, you’ll have to come up with predictive models to support your hypotheses. For example, you’ll have to design code that will show that when certain variables occur, you have a flux in sales. For Big Mart, your predictive models might include holidays and other times of the year when retail sales spike. You may explore whether an after-Christmas sale increases profits, and if so, by how much. You may find that a certain percentage of sales earn more money than other sales, given the volume and overall profit.
Communicate Your Results
In the real world, all the analysis and technical results that you come up with are of little value unless you can explain to your stakeholders what they mean in a way that’s comprehensible and compelling. Data storytelling is a critical and underrated skill that you must develop. To finish your project, you’ll want to create a data visualization or a presentation that explains your results to non-technical folks.
Bonus: How Many Projects Should Be in a Data Science Portfolio?
Data scientist and Springboard mentor David Yakobovitch recently shared expertise on how to optimize a data science portfolio with our data science student community. Among the advice he shared were these tips:
For the Data Science Career Track, we have two capstones that students work on, so I like to say a minimum of two projects in your portfolio. Often when I work with students and they’ve finished the capstones and they’re starting the job search, I say, “Why not start a third project?” That could be using data sets on popular sites such as Kaggle or using a passion project you’re interested in or partnering with a non-profit.
When you’re doing these interviews, you want to have multiple projects you can talk about. If you’re just talking about one project for a 30- to 60-minute interview, it doesn’t give you enough material. So that’s why it’s great to have two or three, because you could talk about the whole workflow—and ideally, these projects work on different components of data science.
Learning the theory behind data science is an important part of the process. But project-based learning is the key to fully understanding the data science process. Springboard emphasizes data science projects in all three data science courses. The Data Science Career Track features 14 real-world projects, including two industry-worthy capstone projects.
Interested in a project-based learning program that comes with the support of a mentor? Check out our Data Science Career Track—you’ll learn the skills and get the personalized guidance you need to land the job you want.
Comments are closed.
|