by Sohaib Hasan | 7 min read
Problem: The City of New York was charged with reaching out to guardians of all 4-year-olds after making pre-kindergarten available to all residents.
Objective: Maximize the number of individuals who received informative phone calls about local pre-kindergarten. Utilize outreach resources as efficiently as possible.
Method: Aggregate all potential sources of contact information of parents of 4-year-old children in the city, and apply fuzzy matching algorithms to create golden records for volunteer use.
A number of studies suggest that access to pre-kindergarten education provides children with a head start that lasts all through their educational career. However, since public schooling in the United States begins at kindergarten, it has been levied that families from lower socioeconomic strata are forced to accept that children from more affluent families will have an educational head start on their children due to their ability to pay for pre-kindergarten education. To counteract this dynamic, the Bill De Blasio administration in New York City made universal, free of cost, access to pre-kindergarten a key initiative for their first year in office. 
Accomplishing this required a number of different work streams. First, classroom space and programs had to be requisitioned (a massive project in and of itself). Next, parents and guardians across the city had to be made aware of the options in their neighborhood, which required the largest outreach effort in the history of New York City. The goal was to find every guardian of a 4-year-old citywide and have a volunteer call him or her with information detailing all the how/when/where’s of their local free pre-kindergarten option.
An exhaustive search revealed a number of potential sources for this information. Simply stacking all of this information on top of each other would likely have the unfortunate consequence of a number of duplicated entries. This would lead to some parents getting multiple calls from multiple volunteers, wasting the time of both parties. Further, with time as a serious and unforgiving constraint (schools must begin on a particular day of the year), there simply wasn’t enough bandwidth to call every household on every list, which would have the other unfortunate consequence of parents of eligible students never getting the information required to register their child.
The fundamental problem, then, was that data regarding parents of 4-year-olds had to be consolidated and presented to volunteers in a manner such that it could be acted upon with no time or resources wasted. This involved a three-pronged data engineering effort.
- Data had to be gathered and placed in a single location.
- Entries in each dataset had to be compared across datasets to consolidate records and create “golden” records and place them in a single location. Golden records are our best estimate of what represented an individual, with all of their contact information, .
- The golden records had to be loaded into a front-end that presented the information in a clear and actionable manner.
My former employer, the Mayor’s Office of Data Analytics (MODA) was tasked with the second of these three items: to solve the problem of creating golden records. In the world of data science/data engineering, this problem is known as ‘record linkage’ or ‘fuzzy matching’ . ‘Matching’, because we are trying to find and match the same entity across different sources. And ‘fuzzy’ because the matches made typically involve some level of uncertainty due to data integrity issues. These issues range from typos in data entry to differences in field structure. Ameliorating this uncertainty lies at the heart of solving fuzzy matching problems.
Interested in such Data for Good-stories? Follow @DataLook.
A total of 7 data sources were used to find records of 4-year-olds and guardians of 4-year-olds. (The data sources are described below this article.)
As a first step, all of the data was placed in a consolidated data warehouse known as Databridge , jointly owned by MODA and the Department of Information Technology and Telecommunications (DoITT). DoITT developed and maintained Extract-Transform-Load (ETL) jobs for all dynamic sets. Concurrently, a team from the Avanade consulting group built a Microsoft Dynamics Customer Relationship Management (CRM) tool called Brianna, ready to receive the results of the fuzzy matching job for use by volunteers. This tool would represent the front-end view of the data, allowing volunteers to work with the data in a more layman friendly interface, as well as to add additional information obtained during the phone calls.
The first step in the fuzzy matching workflow was to choose fields that all datasets had in common and consolidate their structure. The fields selected were last name, address, phone number, and Building Identification Number (BIN). Consolidating their structure involved tasks such as changing all characters to uppercase, removing special characters (e.g., hyphens from phone numbers), and renaming fields. Fields populated with obviously bad data, such as a phone number of “000-000-0000”, were nulled out to limit such instances matching with one another. Next, all of the datasets were fully joined so all of the information was in one place. For the first pass, the goal was to create golden records of households. Hence, matches were tagged if any of the following fields matched: Last Name & Phone Number; Last Name & Address; Address & Phone Number; Building Identification Number & Phone Number.
The fuzzy aspect of the above fields was dealt with using the SAS DQMATCH procedure, which generates tokenized versions of key fields. This method removes insignificant words (like “the” or “Mr.”), removes some vowels, and represents the key pieces of information in the submitted string as a token. More or fewer vowels are removed before tokenization according to the level of sensitivity requested. Generally, a token is a character that maps to a chunk of information within a string. Hence, common groups of strings are represented with a single token. An example of this would be the fact that many last names begin with the string ‘Mc’, e.g. McDonald, McCarren, McCoy. Hence, ‘Mc’ in a last name would be represented by an individual token. A particular character is used as a delimiter token between specific pieces of information, and certain uninformative strings are represented with wildcard tokens. Another token can represent the number of string manipulations that are allowed to create a cluster. The exact method of tokenization can vary, and its methodology can be the subject of an entirely separate discussion.
The tokenized versions are then compared to create clusters of information. A greater amount of sensitivity will lead to more edge cases erring in the direction of creating a new cluster (in our case, a new household), and vice versa. Thus, we now had clustered our data using the above method and cluster IDs were used to identify unique households.
Next, we had to identify individual children within a household. First, the entire dataset was sorted by household ID and the first name of the child. Child names were then standardized. Processing row by row, a flag called “new house” was created each time a new household was encountered. A field was created that represented the Levenshtein distance between the name of the child in the row and the name of the child in the preceding row. Levenshtein distance is defined as the number of string manipulations required to turn one string into another. For example, the distance between “ROBERT” and “JOBERY” is 2 because one must change the “J” to an “R” and the “Y” to a “T” to turn “JOBERY” into “ROBERT”. If the “new house” field for a row was on then a new field called “new child” was turned on. If the “new house” field is off for a row and the Levenshtein distance was greater than 2, then the “new child” field was also turned on. A new “child ID” field was created by concatenating the household ID and the numeric version of the first 6 letters of the child’s first name. If there was no child information present then the household ID was the same as the child ID and it was assumed that the household only contained one 4-year-old (a reasonable assumption, considering the relative infrequency of multiple 4-year-olds).
Now that household IDs and child IDs were created, a new dataset was created with one row per household, and all of the contact information about that household (including the number of children, calculated by counting the number of unique child IDs in the household). Thus, two datasets were output: the first was called the “contributing table” which contained every row from every input dataset, with household and child IDs added. The second was the table of golden records. The latter table was made available to the Brianna system for use by volunteers, at which point they could commence making their way down the list of records, contacting each family as they went.
Several of the datasets were updated on a regular basis. After the initial load, it was necessary to update household and their IDs as records were added or updated in the input datasets. However, we also needed to maintain the results of each run so that household IDs did not change with every execution of the workflow. This was solved by using the contributing set generated by the first run from the previous iteration as in input set from the next execution, and adding each input dataset’s row IDs and the previous household IDs to the match conditions. This rationalizes the creation of the contributing set. Hence, if household IDs matched on previous iterations, they were maintained in the next. The only exception was when a row was updated with new information that made it a better match for a different household, or necessitated the creation of a new household, in which case it was spun off into an existing or new household. Finally, households were sorted by their number of times they appeared in each dataset and the quality of the information compiled about the household. At the end of each iteration, the results of the workflow were inserted into tables designated for access by Brianna. Each morning, input datasets were refreshed and made available to MODA. After insuring the updates had arrived correctly, the script that executed all of the above steps was run.
The result was that, on a daily basis, the back-end of the Brianna system was refreshed with the best contact information of households and the children within them that could potentially be enrolled in a Pre-K program. The information was ordered by the likelihood that calling the household would result in a successful connection and conversation. Hence, volunteer effort was optimized such that the highest value phone calls could be made with the greatest haste. The tables also became useful for compiling a number of reports that calculated, among other things, the number of enrollments on an ongoing basis. By the time the effort was done and dusted, a total of 53,250 students were enrolled in a Pre-K program in their neighborhood, a mammoth achievement for the city and the new administration.
The pre-kindergarten outreach effort was one of the largest outreach initiatives ever launched by a municipal government. It required the collaboration of a large number of city agencies and had many concurrent work-streams, from space provisioning, to permitting, to outreach. The data story was just one of many narratives in this epic, each worthy of its own exposition. However, the MODA story highlights the sometimes surprising applications of data science. Data science was an immensely useful way to efficiently leverage scarce resources and likely led to the enrollment of many students who would not have otherwise known to take advantage of the programs made available to them. When we envision governments executing on projects, we don’t typically imagine programmers and data scientists amongst their ranks. This project shows that any major technical initiative undertaken by an organization would be well-advised to consider employing some data scientists and programmers to see what their skillset can contribute to the problem.
Project duration: Four months
Tools: Base SAS, SAS Dataflux
Sohaib is the former Chief Analyst of NYC Mayor’s Office of Data Analytics. He is now a Data Scientist at OnDeck.
- HHS-Connect: This data source contains entities from the Health and Human Services (HHS) common client index (CCI). The index is composed of the output from an entity resolution process that occurs between three health and human services city agency source systems. All three systems record entities that are receiving or have received benefits from the city in some capacity. This data was a one-time load and remained static. It contains information about each head of household that contains an individual receiving benefits, but also records information about the entity in question, whether or not they are the head of the household.
- Experian: This data was purchased from Experian and reflects all individuals in their systems that have a 4-year-old child in their home and reside in New York City. This was also a one-time load. It is at the head of household level and records no information about children.
- Department of Health and Mental Hygiene (DOHMH) Birth Records: This data source was delivered by DOHMH and reflects every birth recorded in New York city 4 years ago. This was a one-time load and will not be refreshed. This data set is at the child level, but also records information about the child’s guardian.
- SEMS-Matched & SEMS-UnMatched: Two separate tables from the Department of Education’s Student Enrollment Management System (SEMS). They contain the same information about two different populations. The entities recorded are those that pre-registered for seats at schools and are divided by those who matched with the school of their choice and those that did not. SEMS was only active prior to the current active enrollment phase so these datasets are also static. This dataset records information at the child level, but also has guardian information.
- CBECC Online: This data source captures every application to schools designated for pre-k called CBECCs (community based early childhood center). Parents may submit as many applications for as many children as they wish. This dataset is at the application level and records both child and guardian information. It was refreshed on a nightly basis.
- ATS (Automate The School): This source captures every accepted and verified registration for pre-k. There can only be one entry per child and all data is verified with no blanks or corrupted information permitted. It records both child and guardian information and was refreshed nightly.
- Pre-Kids: This source contains information entered for every registration at every CBECC. However, this information is not verified or cleaned. It is at the registration level but the data quality can vary significantly since data entry is done by the CBECC and is not validated. This source was refreshed nightly and records are updated with new or changed information frequently.