Data Broker Services : De-Identification Services

Removing identifying information from patient data so that data cannot be linked to a specific person, mitigates privacy risks to individuals and reduces risk to the organization by minimizing the potential for data breaches. There is also no requirement to obtain authorizations/consents for use of de-identified data. De-identification thereby supports the secondary use of data for life sciences research as well as quality assurance, comparative effectiveness studies, precision medicine initiatives and other endeavors. De-identification attempts to balance the contradictory goals of using and sharing personal information while protecting privacy. Please note, use of de-identified data for Human Subject research, will still generally need IRB review.

The HIPAA Privacy Rule and associated guidance provides two methods for de-identification of health information:

  • Expert Determination Method: A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:
    • Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
    • Documents the methods and results of the analysis that justify such determination.
  • Safe Harbor Method: (i) The 18 identifiers defined in the HIPAA Privacy Rule (45 CFR 164.514(b)(2)) as they pertain to the individual (or of relatives, employers, or household members of the individual) are removed and (ii) the covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information

De-identification leads to information loss which may limit the usefulness of the resulting health information in certain circumstances. This is particularly true for the safe harbor method which utilizes a strict, inflexible approach. The expert method takes a risk-based approach that applies current standards and best practices from de-identification research.

The Office of Privacy & Data Security, in conjunction with UHealth IT, provides services that can de-identify clinical data sets utilizing software solutions from Privacy Analytics, a market leader in the expert de-identification method. As with the safe harbor method, all direct identifiers such as name, social security number, medical record number, email address etc. must still be removed or masked. However, depending on the data sets, the expert method may allow greater flexibility in the handling of indirect identifiers such as, for example, zip code and dates of event. This results in potentially richer data sets that better preserves the analytical qualities of the data while simultaneously seeking an appropriately low probability of re-identification.

Privacy Analytics provides one tool for use in de-identification of structured data. Please note that the software requires data sets with unique record identifiers. UHealth IT will perform the initial extract of the data and can provide guidance on what data fields are available, criteria for data selection etc. This data set will then be passed to the data broker for de-identification.

The de-identification process is an iterative one, where multiple passes may be required through the data set to achieve a sufficiently low risk level. So using a date of birth (DOB) field as an example, generally speaking, de-identified data sets will not include the full date MM\DD\YYYY. An initial pass may be removing the DD component (really replacing all occurrences of actual DD with 01). So effectively the data set will be left with accurate month and year of the DOB MM\YYYY. This could yield an acceptable risk level and if so, the de-identified data set will include that level of precision i.e. accurate month and year of the original date. However, if the risk of re-identification is still above a threshold value, then another pass may be necessary requiring the removal of the month component. So in effect, for the net result to be a data set with an appropriately low level of re-identification, then potentially only the year YYYY component can be retained out of the DOB field. There is also an option to shift dates so that the entire date format is retained. All the dates for a particular patient can be shifted in a manner that attempts to maintain the relationship between different dates.

Hence this process may require frequent interaction and communication between the requestor and data broker to come to an acceptable solution that both acceptably de-identifies the data while retaining the analytical usefulness to the recipient.

Uniquely, Privacy Analytics also offers another tool for de-identification of unstructured data such as progress notes. Note that the unstructured data set (one single field containing free form text, for example) must be a separate and distinct data set from any structured data. These are two distinct tools that are run separately against their respective targets, so one tool is used only for structured data and the other tool is run against only unstructured data. This tool uses natural language processing techniques to detect personally identifiable information (PII) in unstructured text such as names, locations, IDs, CPT codes, age, dates, phone numbers and email addresses. Once this information is detected, there are multiple options to handle including masking (replacing the characters with *****) or replacing (a similar value is chosen at random to realistically mask the personal information, e.g. Bob Smith is replaced with John Jefferies).