Data Broker Services
The growth in use of electronic medical records, electronic insurance claims processing and other healthcare information systems has led to massive increases in the collection and storage of individually identifiable health information. Generally speaking, use of identifiable health information is restricted to the primary purpose of direct provision of healthcare related services, including the associated administrative activities such as billing and insurance claims. Health information remains one of the most sensitive types of personally identifiable information (PII). Individuals want to be assured that their healthcare information, including their diagnoses, lab results and medications, remain only accessible to those with an absolute need to know. The Health Insurance Portability and Accountability Act (HIPAA), enacted by the U.S. Congress in 1996, established national standards for the privacy and security of protected health information (PHI).
On the other hand, the adoption of health information technologies accelerates the potential to facilitate beneficial studies that utilize these increasingly large, complex data sets. Any use or disclosure of an individual’s PHI, outside of the direct provision of healthcare related services, such as research, usually requires explicit authorization/consent from that individual. Obtaining such authorizations/consents can be operationally difficult and challenging, especially after the patient is no longer present. Individuals are also concerned with the increasing frequency of data breaches and resulting potential for identity theft, insurance fraud and exposure of their sensitive health information. Organizations are concerned with incident costs, including regulatory fines, law suits and reputational damage.
The process of de-identification, by which identifying information is removed from a data set so that data cannot be linked to a specific person, mitigates privacy risks to individuals. Using de-identified data, whenever possible, reduces risk to the organization by decreasing the dissemination of regulated data such as PHI, thus effectively minimizing the potential for data breaches. There is also no requirement to obtain authorizations/consents for use of de-identified data. De-identification thereby supports the secondary use of data for life sciences research as well as quality assurance, comparative effectiveness studies, precision medicine initiatives and other endeavors. De-identification attempts to balance the contradictory goals of using and sharing personal information while protecting privacy. Please note, use of de-identified data for Human Subject research, will still generally need IRB review and approval.
The HIPAA Privacy Rule provides two methods for de-identification of health information, expert determination and safe harbor. De-identification leads to information loss which may limit the usefulness of the resulting health information in certain circumstances. This is particularly true for the safe harbor method which utilizes a strict, inflexible approach. The expert method takes a risk-based approach that applies current standards and best practices from de-identification research.
OHPS De-Identification Services
OHPS, in conjunction with UHealth IT, provides services that can de-identify clinical data sets utilizing software solutions from Privacy Analytics, a market leader in the expert de-identification method. As with the safe harbor method, all direct identifiers such as name, social security number, medical record number, email address etc. must still be removed or masked. However, depending on the data sets, the expert method may allow greater flexibility in the handling of indirect identifiers such as, for example, zip code and dates of event. This results in potentially richer data sets that better preserves the analytical qualities of the data while simultaneously seeking an appropriately low probability of re-identification.
Privacy Analytics provides one tool for use in de-identification of structured data. Please note that the software requires data sets with unique record identifiers. UHealth IT will perform the initial extract of the data and can provide guidance on what data fields are available, criteria for data selection etc. This data set will then be passed to the data broker for de-identification.
The de-identification process is an iterative one, where multiple passes may be required through the data set to achieve a sufficiently low risk level. So using a date of birth (DOB) field as an example, generally speaking, de-identified data sets will not include the full date MM\DD\YYYY. An initial pass may be removing the DD component (really replacing all occurrences of actual DD with 01). So effectively the data set will be left with accurate month and year of the DOB MM\YYYY. This could yield an acceptable risk level and if so, the de-identified data set will include that level of precision i.e. accurate month and year of the original date. However, if the risk of re-identification is still above a threshold value, then another pass may be necessary requiring the removal of the month component. So in effect, for the net result to be a data set with an appropriately low level of re-identification, then potentially only the year YYYY component can be retained out of the DOB field. There is also an option to shift dates so that the entire date format is retained. All the dates for a particular patient can be shifted in a manner that attempts to maintain the relationship between different dates.
Hence this process may require frequent interaction and communication between the requestor and data broker to come to an acceptable solution that both acceptably de-identifies the data while retaining the analytical usefulness to the recipient.
Uniquely, Privacy Analytics also offers another tool for de-identification of unstructured data such as progress notes. Note that the unstructured data set (one single field containing free form text, for example) must be a separate and distinct data set from any structured data. These are two distinct tools that are run separately against their respective targets, so one tool is used only for structured data and the other tool is run against only unstructured data. This tool uses natural language processing techniques to detect personally identifiable information (PII) in unstructured text such as names, locations, IDs, CPT codes, age, dates, phone numbers and email addresses. Once this information is detected, there are multiple options to handle including masking (replacing the characters with *****) or replacing (a similar value is chosen at random to realistically mask the personal information, e.g. Bob Smith is replaced with John Jefferies).
A new data request process, which includes the option for expert de-identified data, is currently being finalized. This will utilize the Service Now platform. In the meantime, to learn more about these services and to make requests for de-identified data sets, please contact firstname.lastname@example.org. The initial data extract will be performed by UHealth IT and then routed to the data broker for de-identification. You may be contacted by the UHealth IT team and/or the data broker for clarification of your request.
HIPAA Privacy Rule & De-Identification in Depth
Section 164.514(a) of the HIPAA Privacy Rule provides the standard for de-identification of protected health information. Under this standard, health information is not individually identifiable if it does not identify an individual and the covered entity has no reasonable basis to believe it can be used to identify an individual.
Sections 164.514(b) and(c) of the Privacy Rule contain the implementation specifications that a covered entity must follow to meet the de-identification standard. As shown in Figure 1 from the HHS guidance, the Privacy Rule provides two methods by which health information can be designated as de-identified.
Figure 1. Two methods to achieve de-identification in accordance with the HIPAA Privacy Rule.
The first is the “Expert Determination” method
Implementation specifications: requirements for de-identification of protected health information. A covered entity may determine that health information is not individually identifiable health information only if:
1. A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:
- Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
- Documents the methods and results of the analysis that justify such determination; or
The second is the “Safe Harbor” method
The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
2. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
- The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
- The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
3. All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
4. Telephone numbers
5. Vehicle identifiers and serial numbers, including license plate numbers
6. Fax numbers
7. Device identifiers and serial numbers
8. Email addresses
9. Web Universal Resource Locators (URLs)
10. Social security numbers
11. Internet Protocol (IP) addresses
12. Medical record numbers
13. Biometric identifiers, including finger and voice prints
14. Health plan beneficiary numbers
15. Full-face photographs and any comparable images
16. Account numbers
17. Any other unique identifying number, characteristic, or code as permitted by 164.514 (c); and
18. Certificate/license numbers
The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
Satisfying either method would demonstrate that a covered entity has met the standard in §164.514(a) above. De-identified health information created following these methods is no longer protected by the Privacy Rule because it does not fall within the definition of PHI.
Limited Data Set
A related term is a limited data set (LDS). A limited data set of information may be disclosed to an outside party without a patient’s authorization if certain conditions are met. First, the purpose of the disclosure may only be for research, public health or health care operations. Second, the person receiving the information must generally sign a data use agreement. Specifically, it is distinguished from a Safe Harbor de-identified data set in that it allows retention of the following identifiers:
- dates such as admission, discharge, service, DOB, DOD;
- city, state, five digit or more zip code; and
- ages in years, months or days or hours.
All other identifiers such as name, telephone numbers, email addresses, social security numbers, medical record numbers etc. must be removed. It is important to note that this information is still protected health information or “PHI” under HIPAA. It is not de-identified information and is still subject to the requirements of the Privacy Regulations