Testing Pig scripts that use Avro

Writing unit tests for Apache Pig is easy with PigUnit, however writing tests for scripts that load data from an Avro file requires a little more work. Those impatient to see a working example can check out my demonstration project.

Script under test

The script we want to test simply filters out people who are older than 24 from an Avro file and then saves the results back to an Avro file. Nothing too extraneous.

people = LOAD '$DATA_INPUT' USING AvroStorage();

people_over_24 = FILTER people BY age > 24;

STORE people_over_24 INTO '$DATA_OUTPUT' USING AvroStorage();

The test

The not-so-obvious way of providing the test data is to instantiate the PigTest class with the location of your Avro data-file (the data contained within it is ignored). At runtime PigUnit will use the schema defined in the Avro file to override the LOAD statement with the schema, defined using the AS keyword. You can see this in action by examining the console’s output when running the test.

STORE people_over_24 INTO 'output' USING AvroStorage();
--> none
people: {name: chararray,age: int}
people = LOAD '/var/folders/9n/cj1d204d6559wrszmp760xc80000gn/T/pig-avro-test8589720558697077822.avro' USING AvroStorage();
--> people = LOAD 'file:/tmp/temp-1468326526/tmp-822087986input_data.1471440361913.1' USING PigStorage(',') AS (
    name: chararray,
    age: int
STORE people_over_24 INTO 'output' USING AvroStorage();
--> none

The test data can then be passed into the PigTest class’ assertOutput, as the code segment from my demonstration project shows…

public class PigAvroTest {

    private static final String PIG_SCRIPT = PigAvroTest.class.getResource("filter_people_over_24.pig").getPath();

    // ...

    public void setUp() throws IOException {
        File dataFile = File.createTempFile("pig-avro-test", ".avro");
        writeAvroSchema(dataFile, Person.getClassSchema());

        final String[] arguments = {"DATA_INPUT=" + dataFile.toURI(), "DATA_OUTPUT=output"};
        pigTest = new PigTest(PIG_SCRIPT, arguments);

    public void return_25_year_old_when_provided_24_and_25_year_old() throws Throwable {
        final String[] input = {

        final String[] output = {

        pigTest.assertOutput(INPUT_DATA, input, FILTERED_DATA, output, DELIMITER);

    private static void writeAvroSchema(File file, Schema schema) throws IOException {
        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
        dataFileWriter.create(schema, file);