Pepys’ Diary: Exported Data README

v1.01, 2011-01-24.

The Diary of Samuel Pepys features a new entry from the 17th century diarist every day, with accompanying background information.

A zip file of much of the data in JSON format, including this README, can always be found at: http://www.pepysdiary.com/export/json/pepysdiary_json.zip (around 6MB). The files include around 3,000 Diary Entries, over 4,000 Encyclopedia Topics, and more than 300 thumbnail portraits of people mentioned:

The data will change over time, with the zip file being updated at least once per week until sometime in 2012.

If you have any suggestions for improvements or additions, please do drop me a line: Phil Gyford, phil@gyford.com.

The Data

Each of the JSON files contains two top-level elements, meta and data. So each file's basic structure is like this:

{
    "meta":
    {
        "generated":"2011-01-16T15:33:32+00:00"
    },
    "data":
    [
        {
            ...
        },
        {
            ...
        }
    ]
}

The generated field is the time at which this file was generated, in ISO 8601 format.

The yearly Diary files also contain a year field which contains the year this file is for (ie, the file diary/1660.json has a year field with a value of 1660).

The data element contains an array of objects. Each object is a single entity of a particular type, ie, a single Diary entry, a single Category or a single Topic. We'll look at the data contained in each of these objects.

All fields are present for all data items, even if they are marked "optional". In this case the value will be an empty string (if the field is a string) or null (if the field is a number).

Diary

An example of one of the objects from the Diary JSON files's data element:

{
    "title": "Sunday 1 January 1659\/60",
    "date": "1660-01-01",
    "permalink": "http:\/\/www.pepysdiary.com\/archive\/1660\/01\/01\/",
    "comment_count": 52,
    "text": "<p>Blessed be God, at the end of the last year I was in very good health, without any sense of my old pain, but upon taking of cold.<sup>1<\/sup> I lived in <a href=\"http:\/\/www.pepysdiary.com\/p\/6919.php\">Axe Yard<\/a> having ... to our own home.<\/p>",
    "footnotes": "<p>The year did not legally begin ... Own TIme, book i.<\/li>\n<\/ol>"
}

Here's a description of each field:

title
(string) The title of the Diary entry, ie, the date in text format. Note that dates early in the year have a year of the form "1659/60" because the year didn't officially start until 25th March in England at the time. A date marked "1659/60" will be known as being in 1660 for the purposes of other fields, such as...
date
(string) The date of this Diary entry, in year-month-day format.
permalink
(string) The URL of this day's entry at pepysdiary.com.
comment_count
(number) The number of comments/annotations posted by users on this Diary entry on the website.
text
(string) The text of the Diary entry in HTML. Paragraphs are included. Each entry contains links to relevant pages in the Encyclopedia of the form http://www.pepysdiary.com/p/6919.php where the 6919 is the id of the Encyclopedia Topic (see below). Note that occasionally there are also links to other pages in the diary, of the same form as the permalink field. The only other HTML included is:
  • <sup>1</sup> or (in later entries) <sup id="fnr-1665-01-16"><a href="#fn1-1665-01-16">1</a></sup> which point/link to footnotes (see the next field).
  • <i>l.</i>, italic tags around occurrences of l. s. d. (markers for pounds, shillings and pence).
foonotes
(string, optional) Footnotes, for this entry, if any. Usually these are an ordered list (<ol>) of footnotes, but occasionally have paragraphs as well or instead. Many have <a> links within them, often to Encyclopedia Topics. Some footnotes include <p>, and maybe <blockquote>, tags within the <li> tags.
Later entries may have footnotes with HTML ids and links back to the text, eg:
<li id="fn1-1665-01-16">Among the State Papers ...  1664-65, p. 122) <a href="#fnr1-1665-01-16">&#8617;</a></li>

Encyclopedia Categories

An example of one of the Categories from the encyclopedia/categories.json file's data element:

{
    "id": 10,
    "title": "Food",
    "parent_id": 173
}
id
(number) The unique identifer for this Category.
title
(string) The name of the Category.
parent_id
(number) The unique id of the parent of this Category. If the parent_id is 0 this is a top-level Category with no parent. Using this it should be possible to reconstruct the hierarchy of the Encyclopedia.

Encyclopedia Topics

Two examples of Topics from the encyclopedia/topics.json file's data element. Some of the fields only apply to certain kinds of Topic (although they are all always present). First, a person:

{
    "id": 114,
    "title": "Jemima Carteret (b. Mountagu, \"Mrs\/Lady Jem\")",
    "title_sort": "Carteret, Jemima (b. Mountagu, \"Mrs\/Lady Jem\")",
    "excerpt": "Daughter of Lord Sandwich, married Philip Carteret in 1665.",
    "text": "<p>Daughter of <a href=\"http:\/\/www.pepysdiary.com\/p\/112.php\">Lord Sandwich<\/a> ... <\/p>\n",
    "text_wheatley": "<p>Mrs. Jemimah, or Mrs. Jem, ... <\/p>\n",
    "published date": "2002-12-27",
    "ping_count": 113,
    "comment_count": 5,
    "categories": [
        {
            "id": 2,
            "primary": true
        }
    ],
    "text_author": "Phil Gyford",
    "latitude": null,
    "longitude": null,
    "zoom": null,
    "shape": "",
    "map_category": "",
    "thumbnail_image": false,
    "wikipedia_page": ""
}

The second example is of a location:

{
    "id": 230,
    "title": "New Palace Yard",
    "title_sort": "New Palace Yard",
    "excerpt": "To the northwest of the Houses of Parliament ... ",
    "text": "",
    "text_wheatley": "",
    "published date": "2003-01-28",
    "ping_count": 11,
    "comment_count": 5,
    "categories": [
        {
            "id": 28,
            "primary": true
        }
    ],
    "text_author": "Phil Gyford",
    "latitude": 51.500585069288,
    "longitude": -0.125532746315,
    "zoom": 15,
    "shape": "51.500856,-0.126257;51.500819,-0.124782;51.500188,-0.124916;51.500214,-0.12513;51.500234,-0.125157;51.500248,-0.125281;51.500441,-0.126064;51.500538,-0.126294;51.500638,-0.126498;51.500859,-0.126665;51.500856,-0.126257",
    "map_category": "road",
    "thumbnail_image": false,
    "wikipedia_page": "New_Palace_Yard"
}
id
(number) The unique identifier of this Topic, as used in links from the Diary entries.
title
(string) The name of this Topic.
title_sort
(string) The name of the Topic but more suitable if sorting a list of Topics alphabetically. For many Topics title_sort will be the same as title. But for the names of people, title_sort will have their surname listed first, as in the example above: Carteret, Jemima (b. Mountagu, "Mrs Lady Jem").
excerpt
(string, optional) A brief piece of plain text (no HTML) summarising the Topic. These are the texts in the pop-up boxes you see if you visit the website and mouse over a hyperlink within one of the Diary entries.
text
(string, optional) Some HTML text describing the Topic. If present, this can vary from a few words to a long essay.
text_wheatley
(string, optional) Some HTML text describing the Topic, taken from the footnotes of the 1893 edition of the Diary, written by Henry Wheatley.
published_date
(string) The date on which this Topic was first published on the website. As with Topic ids, these are broadly in the order in which the Topics appear in the diary, but shouldn't be relied on to be so.
ping_count
(number) The number of times this Topic has been linked to from the Diary. I'm not 100% sure of the accuracy of this, but it should be broadly correct.
comment_count
(number) The number of comments/annotations written by users about this Topic on the website.
categories
(array of objects) Each categories object has id (number) and primary (boolean) fields. The id field corresponds to the ids of Categories in the encyclopedia/categories.json file. The primary field indicates whether this is the primary Category for this Topic (although this has little real meaning/use). Each Topic should have at least one Category. I don't think any have more than two.
text_author
(string, optional) If there is anything written in the text field, this the is the name of its author. If you display the text anywhere, please also display the name of its author.
latitude
(number, optional) Some of the Topics which are locations have latitude and longitude positions. If the location also has a shape (see below), the latitude and longitude indicate a roughly central point which you could, for example, center a map of the shape on.
longitude
(number, optional) See latitude, above.
zoom
(number, optional) If the Topic has latitude and longitude then the zoom field is set to a suitable value to use with Google Maps as an initial zoom level. eg, a map of a building in London will be zoomed in further than one of a city in another country.
shape
(string, optional) Some Topics which are locations describe an area or road. Some of these have shape data, which consist of a series of latitude and longitude points describing either a shape outline or a line on a map. Lat/lon are separated with commas, and each pair of points is separated by a semicolon. If the first and last pairs of points are identical, this is a closed outline (eg, a town square or area of a city); otherwise it is a line (eg, a road).
map_category
(string, optional) Some Topics which are locations, have been assigned a map_category, which is no relation to the overall Category hierarchy. This describes a small set of categories which locations can be divided into, for displaying separately on maps. The current possible values are:
area
A broad area within London, eg Covent Garden or St James's Park.
gate
One of the gates into and out of the old City of London, eg Temple Bar or Newgate.
home
One of the buildings in which Pepys lived.
misc
Something that doesn't fit into one of the other categories.
road
A road, street or square in London, eg Leadenhall Street or Spital Square.
stair
One of the landing stairs or docks on the banks of the River Thames, eg Tower Dock or Whitefriars Stairs.
town/village
A settlement outside of London, eg Marylebone. Note that what counted as "London" was a lot smaller in the 17th century.
Note that locations, shapes and these categories are a work in progress and are far from complete.
thumbnail_image
(boolean) Does this Topic have a thumbnail image included (see below)? This is only ever true for (some) people, not for Topics in any other category.
wikipedia_page
(string, optional) If this Topic has a relevant page on the English-language version of Wikipedia, the unique part of the page's URL is included here. eg, if the value of wikipedia_page is Church_of_St._Margaret%2C_Westminster the Wikipedia page is at http://en.wikipedia.org/wiki/Church_of_St._Margaret%2C_Westminster.

Encyclopedia Thumbnails

Also included is a directory of several hundred JPEG images, each one a small portrait of a person who has a Topic in the Encyclopedia. Each image is named like 112.jpg, where 112 correspondes to the id of the Topic. Every Topic that has a thumbnail_image value of true should have a corresponding thumbnail file. All the images are 100 x 120 pixels in size, and are taken from images on Wikipedia.

Licence

The text of the Diary itself (but not the links within the text) comes from Project Gutenberg and is in the public domain. Minor typos have occasionally been fixed.

The thumbnail images all come from Wikipedia and are also considered public domain.

In both the above cases you're expected to check your own country's copyright laws to ensure this applies...

Everything else is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA 3.0) licence.

All non-public-domain elements are copyright Phil Gyford, 2002-2011, except Encyclopedia texts which remain the copyright of their respective authors.

If you pass any of the accompanying data files on, please be sure to include this README document.

Versions

v1.0, 2011-01-22
First release.
v1.1, 2011-01-24
Corrected typo in 'New Palace Yard' example in README.