SOUTH AFRICAN FAMILIES
Assembling archival information to reconstruct family lineages of the European settlers to South Africa from the seventeenth century to the present day allows for an investigation into long-term economic and demographic trends. Historical registries enable the study of the evolution of demographic and socio-economic outcomes across more than just two or three generations, answering questions relating to the inter-generational transmission of socio-economic status or about demographic processes such as fertility, migration, and marriage.
South African scholars are fortunate to benefit from the rich administrative records that are available in the Cape Archives in Cape Town. Historians and genealogists have, over the last century, worked to combine these into a single genealogical dataset of all settlers living in the eighteenth-, nineteenth- and early twentieth century. The dataset in question is one of very few in the world that is known to document a full population of immigrants and their families over several generations. The data was obtained from the Genealogical Institute of South Africa (GISA). GISA's genealogical registers include records of all known families that settled in South Africa and their descendants until 1910 and contains vital information on over half a million individuals over a period of 200 years.
The most recent edition of genealogical registers published by GISA in 2014, contains complete family registers of all settler families from 1652 to approximately 1830 as well as those of new progenitors of settler families up to 1867 for families with surnames starting with the letters L-Z, and up to 1910 for families with surnames starting with the letters A-K. The registers were compiled, inter alia, from baptism and marriage records of the Dutch Reformed Church archives in Cape Town; marriage documents of the courts of Cape Town, Graaff-Reinet, Tulbagh, Colesberg, collected from a card in- dex in the Cape Archives Depot; death notices in the estate files of Cape Town and Bloemfontein; registers of the Reverends Archbell and Lindley; voortrekker baptismal register in the Dutch Reformed Church archive in Cape Town; marriage register of the magistrate of Potchefstroom; and other notable genealogical publications including: C.C. de Villiers (1894) Geslacht-register der oude Kaapsche familin; D. F. du Toit and T. Malherbe (1966) The Family register of the South African nation; J.A.Heese (1971) Die herkoms van die Afrikaner, 1657-1867; I. Mitford-Baberton (1968) Some frontier families and various other genealogies on individual families.
I originally transcribed the SAF registers over a seven month period in 2011. Since the genealogical records were compiled from various sources over several decades using thousands of source documents and dozens of researchers, the PDF version available from GISA required extensive manipulation and cleaning. Some family lineages were compiled by GISA in Afrikaans while others were in English, dependent on the preference of the genealogist in question. For consistency, I converted all information to English. A rudimentary software programme was written to convert the PDF version into CSV format that would allow the data to be used in Excel or STATA. Resulting from a number of inconsistencies in the original series, however, the conversion process required considerable intervention and post-transcription cleaning and required that gender dummies be assigned manually to all individuals.
The final dataset contained information on the following variables: a unique individual ID, a household ID, a generation ID, birth, baptism, marriage and death dates. During this initial phase of transcription, however, GISA undertook to revise and republish the registers, with the aim of correcting errors where possible and extending the series to contain complete family registers of all settler families up to 1930. As of January 2013, GISA had completed this revision process for families with surnames A-K and the institute was kind enough to provide the revised and extended version of the genealogical records, not yet available to the public, for transcription.
A more sophisticated data transcription programme was created to transcribe the latest version of the registers so that more information could be harnessed from the primary data source. This process was completed in April 2013 and the new dataset contains the original set of variables, as well as information on occupation (where available), geographic information for vital events, and spousal information including birth, baptism and death dates and places as well as maiden names (where applicable) and parents names. The inclusion of spousal information was critical for enabling the linking of mothers to their children. Since the genealogies were compiled patrilineally, without the inclusion of spousal information, questions relating to female fertility could not be meaningfully answered.
I created unique individual, family and mothers identifier codes which allow for the matching of offspring to both parents, so that families can be traced with relative ease over multiple generations. I concatenated genealogical codes to individuals unique identifiers to indicate their relative position on their family tree. An individual with a1 at the end of their identifier indicates that they were the patriarch of the family or the first arriver to South Africa. If this individual had 2 children, their respective genealogical codes would be a1b1 and a1b2 and these siblings would share the same household identifier a1b. Dates were all converted to STATA Internal Format.
While the inclusion of the new A-K information provides significant sample size increases, the use of it cannot be permitted without first dispelling all sample selection queries. Having a surname starting with A-K was not found to make an individual systematically any different from one who has a surname starting with L-Z. A two-sample t test with equal variances could not reject the null hypothesis that the difference between the two groups is equal to zero. Similarly no significant difference was found in the sample for other variables of interest including age at first marriage and net fertility. The inclusion of the revised and expanded A-K data into the full dataset is therefore unlikely to introduce additional bias into the sample. No systematic differences between the two versions of the data, other than the increased sample size, indicate that any errors that might remain in the data can be safely attributed to the underlying data, rather than as a result of the transcription process.
In 2016, GISA completed their revisions of L surname families, and these data have now been been digitized, cleaned, and included in the latest release of the full dataset: SAF 2016 v.2.0
Get access to this data
The SAF database is still under construction and it is not yet available for public use. However, if you are interested in gaining access, please send an email detailing your research proposal to this address.