Exploring The Metropolitan Museum of Art's Open Access Initiative dataset

Last week New York’s Metropolitan Museum of Art, or “the Met”, announced they’ve released a selection of their digital catalog. The Met is the largest art museum in the United States, with a collection of over 2 million works, and now we have access to more than 420,000 of these.

The dataset they’ve released is currently only available as a CSV file encoded in UTF-8, but is unrestricted for commercial and noncommercial use under the CC0 license. You can find more details about the dataset and the license over at their github repository.

Get the data

First thing you’ll notice if you try to clone the repository on Github is that you don’t actually get the CSV file you expect, but this:

~$ ls -l
total 20
-rw-r--r-- 1 avitalp avitalp 6555 Feb 13 23:35 LICENSE
-rw-r--r-- 1 avitalp avitalp  134 Feb 13 23:35 MetObjects.csv
-rw-r--r-- 1 avitalp avitalp 4255 Feb 13 23:35 README.md

Notice how the file size is 134 bytes, obviously too small, what gives? Opening it up, you see:

version https://git-lfs.github.com/spec/v1
oid sha256:8f525b729844dc8df16ad18986fa8b11220a22d7a6a6892d24f766620cdebc75
size 225665513

Well, Github uses Git Large File Storage (Git LFS) which is an extension to Git that allows you to work with large files up to 2GB.

You’ll need to grab the latest version of the extension for your system (v.1.5.5 as of this writing). Be aware that you’ll need git version >= 1.8.2 to install Git LFS, so if you’re running an older Ubuntu or Debian version, you may need to upgrade.

I’m running Debian Jessie so I just grabbed the package and installed it:

~$ wget -O git-lfs_1.5.5_amd64.deb https://packagecloud.io/github/git-lfs/packages/debian/jessie/git-lfs_1.5.5_amd64.deb/download
~$ sudo dpkg -i git-lfs_1.5.5_amd64.deb && sudo apt-get install -f

Verify that the installation was successful:

~$ git lfs install
Updated pre-push hook.
Git LFS initialized.

Then do:

~$ git lfs fetch
Fetching master
Git LFS: (1 of 1 files) 214.66 MB / 214.66 MB

When you list the contents of the folder, you’ll see the 225M file has now been downloaded:

~$ ls -l
total 220052
-rw-r--r-- 1 avitalp avitalp      6555 Feb  8 14:13 LICENSE
-rw-r--r-- 1 avitalp avitalp 225088425 Feb  8 15:37 MetObjects.csv
-rw-r--r-- 1 avitalp avitalp      4255 Feb  8 14:13 README.md

If the above doesn’t work for you, try looking at Github issue #325. Alternatively, you could try a direct download using:

~$ wget https://media.githubusercontent.com/media/metmuseum/openaccess/master/MetObjects.csv

Once you have the file, the first row of the file will have the field names available:

Object Number,Is Highlight,Is Public Domain,Object ID,Department,Object Name,Title,Culture,Period,Dynasty,Reign,Portfolio,Artist Role,Artist Prefix,Artist Display Name,Artist Display Bio,Artist Suffix,Artist Alpha Sort,Artist Nationality,Artist Begin Date,Artist End Date,Object Date,Object Begin Date,Object End Date,Medium,Dimensions,Credit Line,Geography Type,City,State,County,Country,Region,Subregion,Locale,Locus,Excavation,River,Classification,Rights and Reproduction,Link Resource,Metadata Date,Repository

Many of these fields are self-explanatory but some may have ambiguous or non-obvious meanings. This data looks like it may span multiple tables in the Met’s database and those relational links were flattned to generate this file. There are also some obviously useful bits of data missing, such as collection details and the original full-size image. We’ll have to wait for the Met to release more definitive documentation, but for now, this is a good start.

Prepare the database

I had initially planned to just install the csv module for Python but changed my mind once I realised the size of the dataset. Besides, I figured exploring and manipulating the data would be easier in SQL. So I decided to create a simple table in my MySQL database:

-- DROP DATABASE IF EXISTS themet;
-- CREATE DATABASE themet DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
USE themet;

CREATE TABLE `met_objects`
(
    `obj_num` text,
    `is_highlight` boolean,
    `is_public_domain` boolean,
    `obj_id` int(11),
    `department` text,
    `obj_name` text,
    `title` text,
    `culture` text,
    `period` text,
    `dynasty` text,
    `reign` text,
    `portfolio` text,
    `artist_role` text,
    `artist_prefix` text,
    `artist_display_name` text,
    `artist_display_bio` text,
    `artist_suffix` text,
    `artist_alpha_sort` text,
    `artist_nationality` text,
    `artist_begin_date` text,
    `artist_end_date` text,
    `obj_date` text,
    `obj_begin_date` text,
    `obj_end_date` text,
    `medium` text,
    `dimensions` text,
    `credit_line` text,
    `geography_type` text,
    `city` text,
    `state` text,
    `county` text,
    `country` text,
    `region` text,
    `subregion` text,
    `locale` text,
    `locus` text,
    `excavation` text,
    `river` text,
    `classification` text,
    `rights_and_reproduction` text,
    `link_resource` text,
    `metadata_date` text,
    `repository` text,
    `full_img` text,
    `collection_details` text,
     PRIMARY KEY (`obj_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_unicode_ci;

Above will create a database named “themet” and a table called “met_objects”. As you can see, this table has a very liberal structure, essentially treating every field as free-form text. The only 3 exceptions are the object ID, which acts as a primary key, and 2 boolean fields, one of which indicates which items are under CC0. I also added the two last fields for the collection details and URL to full-size image as I plan to scrape that information separately (see below).

I added indices to the table on two obviously useful fields:

CREATE INDEX by_artist_display_name ON met_objects(`artist_display_name`(255));
CREATE INDEX by_title ON met_objects(`title`(255));

Keep in mind that MySQL can only index the first N chars of a TEXT column, so you must supply some value N so that it can determine the key length. InnoDB has a limitation of 255 bytes per index key for UTF8 so you won’t be able to create an index longer than that. For these two fields, 255 bytes is enough, but if you need the whole column to be indexed, you’ll have to do something more clever like calculate SHA1 or MD5 hash and use that.

There’s many good candidates for indices in this table, what will be relevant to you will vary; so add a single or multiple column index on the data that’s of interest to you. It might be interesting if someone would configure Elasticsearch with this dataset.

Prepare the data

Next, I converted the file to Unix line endings and converted string instances of “False” and “True” to 0 and 1, respectively, for boolean fields to avoid MySQL warnings on import:

~$ sed -i -e 's/\r$//' -e 's/False/0/g' -e 's/True/1/g' MetObjects.csv

Import the data

Finally I tried a good ol’ LOAD DATA INFILE in MySQL console:

mysql> LOAD DATA LOCAL INFILE 'MetObjects.csv' INTO TABLE met_objects CHARACTER SET 'UTF8' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES;

Query OK, 446029 rows affected, 1 warning (9.07 sec)
Records: 446029  Deleted: 0  Skipped: 0  Warnings: 1

I was expecting more issues but to my surprise, the only warning was “Row 148766 doesn’t contain data for all columns”, which I decided to ignore. Browsing the data after import I still see a lot of spaces in places where NULL should potentially be, but I’ve decided to leave them in this original form - it would be trivial to trim/clean these in SQL. There might be some other things I haven’t found yet but so far, so good.

A quick query shows 200074 objects tagged as public domain with an impressive array of almost 800 classifications. You can find an Excel file with a table showing all classifications here but here are the top 10:

+-------+------------------------+
| total | classification         |
+-------+------------------------+
|  ...  | ...                    |
|  3273 | Photographs|Ephemera   |
|  3587 | Ceramics-Porcelain     |
|  6373 | Photographs            |
|  6639 | Glass                  |
|  6805 | Paintings              |
|  7862 | Textiles-Woven         |
| 11279 | Ceramics               |
| 14558 | Drawings               |
| 19081 |                        |
| 23299 | Prints                 |
+-------+------------------------+

Interestingly, about 19,000 have no classification, possibly due to some import error, something I’ll have to explore in more depth at a later date and update the Github repo where you can find all this stuff. In any event, I was mostly interested in the 6805 paintings as I plan to display them in some custom made digital frames.

As I mentioned before, the dataset is missing links to original full-size images and collection information, I decided to scrape them both. My first attempt involved regular expressions, which was easy enough for finding links with a specific format, but then this issue request made me think there might be more worthwhile data to get in the future, so I switched to BeautifulSoup with the intent of scraping more data items in future.

This super simple scraper code follows below and works on Python 2.7+, you’ll need to install requests, BeautifulSoup, and pyMySql for it to work. All are very popular and active libraries so you shouldn’t have any major issues installing them.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# A simple web scraper to grab additional information
# missing from the Met Mueseum's Open Access Initative
# dataset.
#
# Copyright 2017 Avital Pekker.
# Author: Avital Pekker (hello@avital.ca)
#
# For license information, see LICENSE.TXT
#

import pymysql.cursors
import requests
from bs4 import BeautifulSoup
import time
import sys

PY3 = (sys.version_info > (3, 0))

DB_HOST = "localhost"
DB_NAME = "themet"
DB_USER = "PUT_USERNAME_HERE"
DB_PASS = "PUT_PASSWORD_HERE"
DB_TABLE = "met_objects"

# Only public domain paintings that have an artist name
DB_FIELDS              = "obj_id, obj_num, title, artist_display_name, obj_begin_date, link_resource, full_img, collection_details"
DB_FILTER              = "is_public_domain = 1 AND LENGTH(artist_display_name) > 0 AND classification='Paintings'"
DB_MAX_RESULTS         = 500
DB_SKIP_POPULATED_IMG  = True

connection = pymysql.connect(host=DB_HOST,
                             user=DB_USER,
                             password=DB_PASS,
                             db=DB_NAME,
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)
                             
try:
    with connection.cursor() as cursor:
        sql = "SELECT {0} FROM {1} WHERE {2} LIMIT {3}".format(DB_FIELDS, DB_TABLE, DB_FILTER, DB_MAX_RESULTS)
        cursor.execute(sql)
        db_matches = cursor.fetchall()

        insert_sql = "UPDATE {0} SET full_img = %s, collection_details = %s WHERE obj_id = %s".format(DB_TABLE)

        for row in db_matches:
            if row['full_img'] is not None and DB_SKIP_POPULATED_IMG:
                if PY3:
                    title = row['title']
                else:
                    title = row['title'].encode('utf-8')

                print('Object "{0}" already in DB, skipping...'.format(title))
                continue
            
            response = requests.get(row['link_resource'])
            soup = BeautifulSoup(response.text, "html.parser")
            
            for link in soup.find_all('a', href=True):
                if PY3:
                    attrib_avail = getattr(link.text, 'attr', str) is str
                else:
                    attrib_avail = getattr(link.text, 'attr', unicode) is unicode
                    
                if attrib_avail and link.text.strip() == "Download":
                        # Example link: 
                        # href="{{selectedOrDefaultDownload('http://images.metmuseum.org/CRDImages/ad/original/25592.jpg')}}"
                        
                        # Strip the text around the link
                        clean_link = link['href'][29:-4]
                        
                        # Some items have additional collection info
                        details_matches = soup.find_all('div', {'class': 'collection-details__label'})
                        clean_details = None
                        
                        if details_matches is not None:
                            clean_details = details_matches[0].text.strip()
                        
                        details_found = (clean_details is not None and len(clean_details) > 0)

                        if PY3:
                            title = row['title']
                        else:
                            title = row['title'].encode('utf-8')
                            
                        print('Object "{0}", original full image: {1}; details found: {2}' \
                            .format(title, clean_link, details_found))
                                
                        cursor.execute(insert_sql, (clean_link, clean_details, row['obj_id']))
            
            # Slow down - please don't bombard the site with requests.
            # Ideally, copy my database.
            time.sleep(3)
            
        # Save to DB - PyMySQL disables autocommit by default.
        connection.commit()
finally:
    connection.close()

You should see output such as:

Object "Plaque of George Washington", original full image: http://images.metmuseum.org/CRDImages/ad/original/25534.jpg; details found: False
Object "Apple Blossoms", original full image: http://images.metmuseum.org/CRDImages/ap/original/APS1730.jpg; details found: True

I ran the scraper over the past week and populated the database. You can find my dump of this data and the scraper at the Github repo: avitalp/metmuseum-oa-explore

Once you get the SQL file you can restore it using (make sure you create the database first):

mysql -u USER -p themet < met_objects.sql

That’s it! Hope you find this useful and do something interesting with it.

Use this in a cool project? Spotted an error? Say hi, let me know! Drop me a line at hello@avital.ca

More: