Joins – Part 2

Today we will see how to work with joins in Pig Latin script. In my previous post “Joins – Part 1” we have seen What are joins?
 
First we load datasets “person” and “employees” into relations person and employees

-- load person
person = load  '/tmp/input/person.csv'
         using PigStorage(',')
            as (EmpID: chararray, Name: chararray, Birthdate: chararray, Birthplace: chararray,
                BirthCountry: chararray, BirthState: chararray, Country: chararray, Address1: chararray,
                Address2: chararray, Address3: chararray, Address4: chararray, City: chararray,
                County: chararray, State: chararray, Postal: chararray, Phone: chararray,
                EmailAddress: chararray);

-- load employees
employees = load  '/tmp/input/employees.csv'
            using PigStorage(',')
               as (EmpID: chararray, OrgRelation: chararray, Gender: chararray, MaritalStatus: chararray,
                   HireDate: chararray, TerminationDate: chararray, ReportsTo: chararray, SupervisorID: chararray,
                   DeptID: chararray, JobCode: chararray, RegTemp: chararray, FullPart: chararray, Company: chararray,
                   Paygroup: chararray, CompensationFrequency: chararray, CompensationRate: long,
                   AnnualRate: long, JobTitle: chararray);

 
1. Inner Join

-- inner join
innerJoin = join person By EmpID, employees By EmpID;

innerSet = foreach  innerJoin
           generate person::EmpID, Name, Birthdate, BirthCountry, Phone, EmailAddress,
                    Gender, MaritalStatus, JobTitle, AnnualRate, HireDate, TerminationDate;

store innerSet into '/tmp/output/innerJoin' using PigStorage('|');

Inner join output resultset —

ID|Name|Birthdate|Country|Phone|EmailAddress|Gender|MaritalStatus|JobTitle||StartDate|TerminationDate
KC0001|Martina Griffiths|1956-03-31|CAN|9256947901|HCMGENUser1@ap6023fems.us.oracle.com|F|M|Secretary-Senior|31500|1990-01-01|
KC0003|Cynthia Rogers|1950-07-11||9256947903|HCMGENUser1@ap6023fems.us.oracle.com|F|S|Clerk-Accounting|23010|1988-03-12|
KC0015|Barry Campbell|1950-12-31|CAN|9256947915|HCMGENUser1@ap6023fems.us.oracle.com|M|S|Specialist-Training|43200|1984-11-02|2000-01-28
KC0016|Paulette Vaillancourt|1929-12-10||9256947916|HCMGENUser1@ap6023fems.us.oracle.com|F|M|Analyst-Human Resources|52320|1981-03-20|1999-12-31
KC0018|Helen Johnson|1957-10-11||9256947918|HCMGENUser1@ap6023fems.us.oracle.com|F|S|Specialist-Benefits|50400|1981-10-25|
KWG126|Johann Moser|1950-04-15|CHE|49891430|HCMGENUser1@ap6023fems.us.oracle.com|M|M|Cleaner|20000|1965-03-01|2005-12-31
KWG127|Anita Zahnd|1966-05-23|CHE|49891430|HCMGENUser1@ap6023fems.us.oracle.com|F|S|Researcher|110000|1985-02-01|2005-12-31
KWG128|Susette Racine|1946-08-26|CHE|49891430|HCMGENUser1@ap6023fems.us.oracle.com|F|M|Other Job2|20000|1960-04-01|2005-12-31

 
2. Full Outer Join

-- full outer join
fullOuterJoin = join person By EmpID FULL OUTER, employees By EmpID;

fullOuterSet = foreach  fullOuterJoin {
                  EmplID = ((employees::EmpID is NULL) ? person::EmpID : employees::EmpID);
                  generate EmplID, Name, Birthdate, BirthCountry, Phone, EmailAddress, Gender,
                           MaritalStatus, JobTitle, AnnualRate, HireDate, TerminationDate;
                        };

store fullOuterSet into '/tmp/output/fullOuterJoin' using PigStorage('|');

Full outer join output resultset —

ID|Name|Birthdate|Country|Phone|EmailAddress|Gender|MaritalStatus|JobTitle||StartDate|TerminationDate
KC0001|Martina Griffiths|1956-03-31|CAN|9256947901|HCMGENUser1@ap6023fems.us.oracle.com|F|M|Secretary-Senior|31500|1990-01-01|
KC0003|Cynthia Rogers|1950-07-11||9256947903|HCMGENUser1@ap6023fems.us.oracle.com|F|S|Clerk-Accounting|23010|1988-03-12|
KC0015|Barry Campbell|1950-12-31|CAN|9256947915|HCMGENUser1@ap6023fems.us.oracle.com|M|S|Specialist-Training|43200|1984-11-02|2000-01-28
SFTRN137||||||U|S|Teacher's Assistant|42000|2000-01-01|
SFTRN147||||||U|S|Teacher's Assistant|42000|2000-01-01|
SFTRN157||||||U|S|Teacher's Assistant|42000|2000-01-01|
ADCRM1001|Savana Miller|1986-08-21||8185559231|HCMGENUser1@ap6023fems.us.oracle.com||||||
ADCRM1007|Timothy Franzen|1964-03-13||8185559090|HCMGENUser1@ap6023fems.us.oracle.com||||||
ADCRM1008|Rebecca Irvin|1966-04-22||8185551199|HCMGENUser1@ap6023fems.us.oracle.com||||||

 
3. Left Outer Join

-- left outer join
leftOuterJoin = join person By EmpID LEFT OUTER, employees By EmpID;

leftOuterSet = foreach  leftOuterJoin {
                  EmplID = ((employees::EmpID is NULL) ? person::EmpID : employees::EmpID);
                  generate EmplID, Name, Birthdate, BirthCountry, Phone, EmailAddress, Gender,
                           MaritalStatus, JobTitle, AnnualRate, HireDate, TerminationDate;
                        };

store leftOuterSet into '/tmp/output/leftOuterJoin' using PigStorage('|');

Left outer join output resultset —

ID|Name|Birthdate|Country|Phone|EmailAddress|Gender|MaritalStatus|JobTitle||StartDate|TerminationDate
KC0001|Martina Griffiths|1956-03-31|CAN|9256947901|HCMGENUser1@ap6023fems.us.oracle.com|F|M|Secretary-Senior|31500|1990-01-01|
KC0003|Cynthia Rogers|1950-07-11||9256947903|HCMGENUser1@ap6023fems.us.oracle.com|F|S|Clerk-Accounting|23010|1988-03-12|
KC0015|Barry Campbell|1950-12-31|CAN|9256947915|HCMGENUser1@ap6023fems.us.oracle.com|M|S|Specialist-Training|43200|1984-11-02|2000-01-28
ADCRM1001|Savana Miller|1986-08-21||8185559231|HCMGENUser1@ap6023fems.us.oracle.com||||||
ADCRM1007|Timothy Franzen|1964-03-13||8185559090|HCMGENUser1@ap6023fems.us.oracle.com||||||
ADCRM1008|Rebecca Irvin|1966-04-22||8185551199|HCMGENUser1@ap6023fems.us.oracle.com||||||

 
4. Right Outer Join

-- right outer join
rightOuterJoin = join person By EmpID RIGHT OUTER, employees By EmpID;

rightOuterSet = foreach  leftOuterJoin {
                  EmplID = ((employees::EmpID is NULL) ? person::EmpID : employees::EmpID);
                  generate EmplID, Name, Birthdate, BirthCountry, Phone, EmailAddress, Gender,
                           MaritalStatus, JobTitle, AnnualRate, HireDate, TerminationDate;
                        };

store rightOuterSet into '/tmp/output/rightOuterJoin' using PigStorage('|');

Right outer join output resultset —

ID|Name|Birthdate|Country|Phone|EmailAddress|Gender|MaritalStatus|JobTitle||StartDate|TerminationDate
KC0001|Martina Griffiths|1956-03-31|CAN|9256947901|HCMGENUser1@ap6023fems.us.oracle.com|F|M|Secretary-Senior|31500|1990-01-01|
KC0003|Cynthia Rogers|1950-07-11||9256947903|HCMGENUser1@ap6023fems.us.oracle.com|F|S|Clerk-Accounting|23010|1988-03-12|
KC0015|Barry Campbell|1950-12-31|CAN|9256947915|HCMGENUser1@ap6023fems.us.oracle.com|M|S|Specialist-Training|43200|1984-11-02|2000-01-28
SFTRN137||||||U|S|Teacher's Assistant|42000|2000-01-01|
SFTRN147||||||U|S|Teacher's Assistant|42000|2000-01-01|
SFTRN157||||||U|S|Teacher's Assistant|42000|2000-01-01|

 
For more on Pig commands, functions and UDFs follow Qubole Pig Cheatsheet.
 
Stay tuned we will have updates on Hive joins in our next post.

Leave a Reply

Your email address will not be published. Required fields are marked *