As we near the end of WGBH’s involvement in the management of the American Archive Content Inventory Project, we wanted to provide some details on what work has been done to improve participant data. All records submitted by stations who received funding from CPB for their participation in the inventory project went through a process we call normalization.
How The Process Works
Normalization was done by a WGBH staff member who individually reviewed each station’s records to help ensure that all submitted data conformed to the vocabulary prescribed in PBCore 1.3 to the best of its ability.
Normalization was performed utilizing an analyze function engineered by our web developer. Once the analyze feature has been run, the WGBH staff member see the user generated data broken up into PBCore field headings, accompanied by a list of those terms the station used in those fields.
In the example provided above for WGBH, the following terms were used in the Media Type field: Animation, Artifact, Moving Image, Sound & asfgd. These terms are therefore listed below the field heading labeled: //pbcore:pbcoreInstantiation/pbcore:formatMediaType.
Next to each term is a number that delineates how many of WGBH’s records contain this term in the expressed field. Moving image for example, appears in 2,284 records in the Media Type field.
If the WGBH staff member wanted to normalize the term “Moving Image” to something else, they would then click the “fix” link next to that term. This would open a window that would allow them to replace all instances of the term in that specific field with whatever term they deemed more appropriate. This is called applying a filter or a normalization.
With certain fields like Title or Contributor, where a predetermined vocabulary does not necessarily apply, a WGBH staff member reviewed all the entries used in that field and made individual decisions based on the patterns present in the data. Normalization work is left to the discretion of the WGBH staff member assigned to the station, as each station’s records vary greatly from one another. We did however provide normalization guidelines during our training sessions. One of the first rules imposed on our staff was that when in doubt, they were advised not to make changes to participant data. In lieu of making changes, staff members addressed any questions about data in their reports. If participant intention could be gaged within some degree of accuracy however, we asked staff members to make changes. Changes might include correcting spelling errors, spelling out abbreviations, removing symbols, correcting spacing or formatting inconsistencies and choosing name and title authorities based on the format used the most.
One of the benefits of normalizing data the way WGBH has, is that the changes made to the data can be undone if an error is found. The reason this is possible is because staff members use filters to normalize the data instead of going into the record and changing the original submitted data.
Normalization work was done for the following fields: Physical Format, Digital Format, Encoding, Asset Type, Identifier, Identifier Source, Duration, Subject, Coverage, Coverage Type, Creator, Contributor, Publisher, Description Type, Generations, Language, Creator Role, Contributor Role, Publisher Role, and Titles.
Recording The Changes: Normalization Excel Documents
WGBH employees performing normalizations recorded all the changes they made to participant data in an Excel document specific to each station. These documents can be accessed on the website on each stations individual Inventory tab, below the familiar section labeled “Datasets.” There you will see a section labeled “For Station Reference. ” You can download the document by clicking on the link labeled “Normalization Spreadsheet.” This will open a window that gives more details about the document including a “Download” feature.
Once you’ve downloaded and opened the Excel File, you will see that the Excel workbook has three sheets, accessible along the bottom of the screen. Each sheet (tab) describes data that WGBH has either normalized, or thinks might need to be normalized. The tabs are: “Entered Online”, “Filter Work” & “Questions.”
The “Filter Work” and “Entered Online” tabs describe changes that WGBH staff have made to your data. For each change made, a row is used in the Excel spreadsheet. The row displays first the PBCore field name where the data appears, then the original data that we wish to change, the field name again, and then the normalized data.
The “Questions” tab lists all of the data that WGBH staff thought might be inaccurate or incomplete. WGBH wanted to bring these instances to the attention of the stations without making any changes that might be inappropriate.
CPB will be in contact with participants regarding instructions on how participants can download a full copy of your normalized records sometime in the near future.
We hope you are happy with the changes made to your data. We had many wonderful staff members for whom we are extremely grateful that dedicated large chunks of their time to normalizing participant data!