Introduction to Structured Data

A great deal of the work in Digital Humanities is oriented around structured data and collections of structured data objects. Prospect is specifically geared to create, curate and visualize structured data collections, so if you want to use Prospect (or other similar platforms) to design and create your own project, you (or your project manager) will need to understand the nature of structured data.

“Structured data” refers to information that conforms to an organized and pre-defined pattern. You might think of it as a contract providing mutually agreed-upon standards between an entity providing data and an entity using the data. Some person or group has to come up with the design for this pattern, and designs usually require deciding between a set of tradeoffs. There is seldom a single and exclusive “correct” solution.

Structured data is an extremely effective strategy for building and interpreting Digital Humanities projects because:

  • It requires you – the expert of a specific domain – to think through the implications of research on your subjects of enquiry and commit to a specific and explicit representation
  • It enables the digital platform to process, handle and visualize the information at a large scale
  • It facilitates the collection, creation and curation of the information in a form suitable for input and editing by multiple and simultaneous users (i.e., “crowd-sourcing,” student projects, etc.)
  • It stores the information efficiently and facilitates long-term data sustainability requirements

Example: American Presidents

Let’s say that you’re interested in creating a structured data collection about the first American presidents. If you wanted to create data for your project by assigning students the task of collecting data about these presidents, they could fill out information on a form, such as the following:

Name: ______________
Order (1-n): _________
Terms (0-2): _________
Party: [ ] Democratic [ ] Democratic-Republican [ ] Federalist [ ] Independent  [ ] Republican  [ ] Whig
Birthplace— State: _______  Location: ___________  Lat-Lon: _____ , ______

By providing a form with specific questions, you ensure that all students are collecting the same information in a consistent manner. The forms for the first two presidents would look like this:

Name: George Washington
Order: 1
Terms: 2
Party: Independent
Lifespan: 1732-2-22/1799-12-14
Birthplace— State: VA  Location: Westmoreland Cty  Lat-Lon: 38.11,-76.8

Name: John Adams
Order: 2
Terms: 2
Party: Federalist
Lifespan: 1735-10-30/1826-7-4
Birthplace— State: MA  Location: Braintree  Lat-Lon: 42.206,-71.005

Data design is an acquired skill which can take time and experimentation to master, and there is no single “correct” answer for any universe of discourse. Exactly what information you include in a form depends on what data you wish to provide for the users of the project, what kinds of visualizations you wish to use, and so on. Data is what sits in the middle between the concepts and information inherent in a research project and the digital technologies used to represent and manipulate the information that represents it. You cannot visualize data that you do not have.

You will need latitude-longitude coordinations in your data, for example, if you wish to display a map; in the forms above, the coordinates indicate the president’s place of birth. You will need chronological data – lifespans, or terms of office, or the like – to view your data on a timeline.

Prospect Terminology

The empty form corresponds to a Template in Prospect: this is the structure you provide to define a certain kind of data object. There are corresponding terms and concepts in computer and information science: a “table” (in database jargon), an “object” (in object-oriented software paradigms), a “data schema” (in information science), etc.

Each “slot” of information in the Template is called an Attribute in Prospect. Attributes are the “atomic units” that make up Templates. In the “President” Template above, the Attributes are “Name,” “Order,” “Terms,” and so on. There are corresponding terms and concepts in computer and information science: a “field” (in database jargon), a “property” or “member” (in object-oriented programming), etc.

Every Attribute is an instance a particular, singular data type which determines the format, meaning and use of the data value. For example, the “Order” Attribute is a number; the value “5” is a valid number but “Green” is not. A discussion of some of Prospect’s data types is below, and an exhaustive list, with notes about the format of the data, is given in the Prospect user manual.

When you fill out a Template with data, that “filled-out form” is called a Record. We have seen two Records above, one describing George Washington and another describing John Adams. Each Attribute has been given a specific Attribute data value, allowing the conceptual abstraction inherent in the Template about presidents to describe an actual example of one.

Extending the Presidents Data Model

It is easy to see the limitations of this Template design: it is rather specific to the circumstances of the early American presidents.

If we wanted to extend it to the present day, we might want to add an Attribute to indicate the racial category of the president. If we wanted to anticipate future needs, we might want to add an Attribute to indicate the gender of the president.

If we wanted to create a structured data collection about Canadian Prime Ministers, some of the Attributes would remain the same but others would have to change. While we could keep the “Party” Attribute, for example, the specific set of options for political parties would have to change. We might also want to add an Attribute to indicate the languages that the leader speaks (given that there are two official national languages in Canada).

Attribute Data Types

Prospect implements a wide range of data types intended to support the many needs of humanities research. Some of the most common and useful are:

  • Vocabulary: short for “fixed vocabulary”; a pre-selected and limited set of discrete options, such as the political parties on the president Template
  • Text: a single textual entry without any limitations or format imposed upon it, such a name or a description
  • Number: numerical values (currently only integers)
  • Dates: a single date or a date range in the format YYYY-MM-DD
  • Lat-Lon: a latitude-longitude coordinate on a map with a comma separator, such as “2.36772, 53.091”
  • Image: the URL to a JPG, PNG or GIF file
  • Link To: the URL to an external webpage on another webserver
  • Audio: the URL to an MP3 file or to a SoundCloud audio recording
  • YouTube: the code for a YouTube video (rather than the entire URL)

You can find further details about Attribute data types and the format to which data must conform in the Prospect user manual.


A Legend facilitates graphical representations of information by allowing Prospect to translate Attribute data values into colors. Legends are defined on an Attribute-by-Attribute basis. In other words, the person who defines the Attributes for your project in Prospect can define how the values of each Attribute should be translated into colors.

Only certain data types can have Legends:

  • Vocabulary: Each term in the Attribute’s fixed vocabulary can have its own color.
  • Text: The administrator can define text patterns against which text values can be matched; each pattern can have its own color.
  • Number: Each range of values can have its own label and color.
  • Dates: Each date range can have its own label and color.