Data Broker Services

Data Broker Services

The growth in use of electronic medical records, electronic insurance claims processing and other healthcare information systems has led to massive increases in the collection and storage of individually identifiable health information. Generally speaking, use of identifiable health information is restricted to the primary purpose of direct provision of healthcare related services, including the associated administrative activities such as billing and insurance claims. Health information remains one of the most sensitive types of personally identifiable information (PII). Individuals want to be assured that their healthcare information, including their diagnoses, lab results and medications, remain only accessible to those with an absolute need to know. The Health Insurance Portability and Accountability Act (HIPAA), enacted by the U.S. Congress in 1996, established national standards for the privacy and security of protected health information (PHI).

On the other hand, the adoption of health information technologies accelerates the potential to facilitate beneficial studies that utilize these increasingly large, complex data sets. Any use or disclosure of an individual’s PHI, outside of the direct provision of healthcare related services, such as research, usually requires explicit authorization/consent from that individual. Obtaining such authorizations/consents can be operationally difficult and challenging, especially after the patient is no longer present. Individuals are also concerned with the increasing frequency of data breaches and resulting potential for identity theft, insurance fraud and exposure of their sensitive health information. Organizations are concerned with incident costs, including regulatory fines, law suits and reputational damage.

The process of de-identification, by which identifying information is removed from a data set so that data cannot be linked to a specific person, mitigates privacy risks to individuals. Using de-identified data, whenever possible, reduces risk to the organization by decreasing the dissemination of regulated data such as PHI, thus effectively minimizing the potential for data breaches. There is also no requirement to obtain authorizations/consents for use of de-identified data. De-identification thereby supports the secondary use of data for life sciences research as well as quality assurance, comparative effectiveness studies, precision medicine initiatives and other endeavors. De-identification attempts to balance the contradictory goals of using and sharing personal information while protecting privacy. Please note, use of de-identified data for Human Subject research, will still generally need IRB review and approval.

The HIPAA Privacy Rule provides two methods for de-identification of health information, expert determination and safe harbor. De-identification leads to information loss which may limit the usefulness of the resulting health information in certain circumstances. This is particularly true for the safe harbor method which utilizes a strict, inflexible approach. The expert method takes a risk-based approach that applies current standards and best practices from de-identification research.

Privacy Office De-Identification Services

Privacy Office, in conjunction with UHealth IT, provides services that can de-identify clinical data sets utilizing software solutions from Privacy Analytics, a market leader in the expert de-identification method. As with the safe harbor method, all direct identifiers such as name, social security number, medical record number, email address etc. must still be removed or masked. However, depending on the data sets, the expert method may allow greater flexibility in the handling of indirect identifiers such as, for example, zip code and dates of event. This results in potentially richer data sets that better preserves the analytical qualities of the data while simultaneously seeking an appropriately low probability of re-identification.

Privacy Analytics provides one tool for use in de-identification of structured data. Please note that the software requires data sets with unique record identifiers. UHealth IT will perform the initial extract of the data and can provide guidance on what data fields are available, criteria for data selection etc. This data set will then be passed to the data broker for de-identification.

The de-identification process is an iterative one, where multiple passes may be required through the data set to achieve a sufficiently low risk level. So using a date of birth (DOB) field as an example, generally speaking, de-identified data sets will not include the full date MM\DD\YYYY. An initial pass may be removing the DD component (really replacing all occurrences of actual DD with 01). So effectively the data set will be left with accurate month and year of the DOB MM\YYYY. This could yield an acceptable risk level and if so, the de-identified data set will include that level of precision i.e. accurate month and year of the original date. However, if the risk of re-identification is still above a threshold value, then another pass may be necessary requiring the removal of the month component. So in effect, for the net result to be a data set with an appropriately low level of re-identification, then potentially only the year YYYY component can be retained out of the DOB field. There is also an option to shift dates so that the entire date format is retained. All the dates for a particular patient can be shifted in a manner that attempts to maintain the relationship between different dates.

Hence this process may require frequent interaction and communication between the requestor and data broker to come to an acceptable solution that both acceptably de-identifies the data while retaining the analytical usefulness to the recipient.

Uniquely, Privacy Analytics also offers another tool for de-identification of unstructured data such as progress notes. Note that the unstructured data set (one single field containing free form text, for example) must be a separate and distinct data set from any structured data. These are two distinct tools that are run separately against their respective targets, so one tool is used only for structured data and the other tool is run against only unstructured data. This tool uses natural language processing techniques to detect personally identifiable information (PII) in unstructured text such as names, locations, IDs, CPT codes, age, dates, phone numbers and email addresses. Once this information is detected, there are multiple options to handle including masking (replacing the characters with *****) or replacing (a similar value is chosen at random to realistically mask the personal information, e.g. Bob Smith is replaced with John Jefferies).