Extracting Reporting Data with Python

Overview

If you like the idea of using the Seismic reporting APIs to get data into files or a database, but aren't quite sure how to go about doing that, the script below may be able to help!
The Python script below can extract data from any of the reporting APIs and save that data into JSON files or a SQL database. The script also provides a useful foundation to extend for other custom data extraction needs.
This script is intended to be used with the password authentication flow or the refresh token flow.

Before You Begin

Python and pip setup

In order to run this script you will need Python installed on your machine. We recommend using Python 3.7.1 or later, but the script will also work with Python 2.7.x.
You can download the latest version of python here.
In addition to Python, we strongly recommend you have pip, the Python package manager. If you have Python 3.7.1 or later, pip comes with your Python install. If you are using an older 2.7 version of Python, you may need to manually install pip from https://pypi.org/project/pip/
We also recommend occasionally updating pip using python -m pip install --upgrade pip

Installing dependencies

Once you have Python and pip installed on your machine, you should make sure you have all of the necessary dependencies installed. If you are using Python 3, many of these are already included, if you are using Python 2, you will need to install several manually.

For Python 3, run the following:

pip install pymssql

For Python 2, run the following:

pip install pymssql
pip install urllib
pip install json
pip install argparse

Usage

This script has a set of constants which need to be added directly to the script as described in the table below

Constant	Description
TENANT	The name of your tenant. If your Seismic url is acme.seismic.com, then acme is your tenant.
CLIENT_ID	Your client_id for a password flow client. See Get Started for more info.
CLIENT_SECRET	Your client_secret for a password flow client. See Get Started for more info.
TENANT_USERNAME	The username of a user in your tenant you will use to run the APIs
TENANT_PASSWORD	The direct login password for the user (note, this is not the same as the SSO password). This user typically requires direct login privileges in the tenant. Contact your customer success team for assistance if needed.
REFRESH_TOKEN	If you would like to use a refresh token flow rather than password flow, provide your refresh token here. Note, you will need a multi-use refresh token. Leave this field blank if you want to use password flow.
SQL_SERVER	The host for your SQL server if you want to replicate the data to a SQL server. Note, this script assumes the destination is a Microsoft SQL server using port 1433, although the script can easily be modified for use with other forms of SQL servers.
SQL_DB	The name of the database to add tables to
SQL_USERNAME	The username of the user in the SQL server
SQL_PASSWORD	The password of the user in the SQL server

Once these constants are added to the Python script, the script can be executed from the command line in a tool such as Windows PowerShell.
The typical way to run the script would be to run to following command python myscriptname.py --all --json which will place all reports into your local folder in JSON format.
We also support the following arguments if you would like a bit more control over what the script runs

Command Line Argument	Description
-r or --reports	A comma separated list of reports to run from any of the reports listed on in the reporting APIs. This option should not be used if --all is used.
--all	Runs all reports and saves the data to individual JSON or SQL tables
-s or --startdate	The start date for the report such as 2017-01-01. Exclude this parameter if you want all data for the reports.
-e or --enddate	The enddate for the report such as 2019-01-01. Exclude this parameter if you want all data for the reports.
--sql	Include this parameter if you want data exported to SQL
--json	Include this parameter if you want data exported to JSON
--csv	Include this parameter if you want data exported to CSV
--prefix	Optional prefix to add to JSON filename or SQL table names

Script

The main() function has the primary functional code for the script taking in arguments, authenticating, calling the APIs and saving the data to JSON and SQL.
The getDataFromAPI() function includes the code necessary for extracting data from the APIs including the use and propagation of the Continuation header.
The getBearerToken() function includes the code necessary to exchange credentials to get a bearer token
The remainder of the functions in the script are largely supporting functions

from __future__ import print_function
import os, sys
import urllib
import json
from datetime import datetime
import time
import argparse

try: #Try Python3
    from urllib.request import urlopen, Request
    from urllib.error import HTTPError
    from urllib.parse import urlencode, quote_plus
except ImportError: # Fallback to Python 2
    from urllib2 import urlopen, Request
    from urllib import urlencode
    from urllib2 import HTTPError


TENANT=''
CLIENT_ID=''
CLIENT_SECRET=''
TENANT_USERNAME=''
TENANT_PASSWORD=''
REFRESH_TOKEN='' # Set this only if you want to use a refresh_token flow instead of password flow


SQL_SERVER=''
SQL_DB=''
SQL_USERNAME=''
SQL_PASSWORD=''


reportList = [
    'contentProfileAssignments',
    'contentProfiles',
    'contentProperties',
    'contentPropertyAssignments',
    'contentUsageHistory',
    'contentViewHistory',
    'contents',
    'externalUsers',
    'generatedLivedocComponents',
    'generatedLivedocFields',
    'generatedLivedocOutputFormats',
    'generatedLivedocSlides',
    'generatedLivedocs',
    'groupMembers',
    'groups',
    'libraryContentVersions',
    'libraryContents',
    'livesendLinkContents',
    'livesendLinkMembers',
    'livesendLinks',
    'livesendPageViews',
    'livesendViewingSessions',
    'searchHistory',
    'searchWords',
    'teamsites',
    'userActivity',
    'userProperties',
    'userPropertyAssignments',
    'users',
    'workspaceContentVersions',
    'workspaceContents'
]



###################################################################################
############  MAIN FUNCTION TO AUTH, RUN API, SAVE DATA TO JSON + SQL  ############
###################################################################################
def main():
    args = parse_args()

    ##############  LOGIN USING DIRECT LOGON CREDENTIALS TO GET AN ACCESS TOKEN  ##############
    bearer = getBearerToken(TENANT)
    
    ##############  CONNECT TO SQL IF SPECIFIED  ##############
    if(args.sql):    
        try:
            import pymssql
        except:
            print("You do not have pymssql installed.  Please try to run the following command: pip install pymssql")
            quit()
        try:
            dbCon = pymssql.connect(server=SQL_SERVER, database=SQL_DB, user=SQL_USERNAME, password=SQL_PASSWORD,  tds_version='8.0', port='1433', charset="UTF-8")
            dbCur = dbCon.cursor()            
        except Exception as e:
            print('Error trying to connect to database.  SERVER:' + SQL_SERVER + ' DATABASE:' + SQL_DB + ' USER:' + SQL_USERNAME + ' PASSWORD:' + SQL_PASSWORD)
            print(str(e))
            quit()

    ##############  GET THE LIST OF REPORTS TO RUN  ##############
    try:
        if(args.all):
            reports = reportList
        else:
            reports = args.reports.split(',')
    except Exception as e:
        print('You must specify a comma separated list of reports using --report or specify --all to run all reports')

    
    ##############  RUN EACH OF THE REPORTS  ##############
    for report in reports:        
        url = 'https://api.seismic.com/reporting/v2/' + report + '?occuredAtStartTime=' + args.startdate + '&occuredAtEndTime=' + args.enddate
        print( "Getting data from url: " + url)
        if(args.csv):
            format = 'csv'
        else:
            format = 'json'
            
        data = getDataFromAPI(url, bearer, '', format)
        print('')
        if(args.json):
            print('Creating json file ' + args.prefix + report + '.json')
            dataFile = open(args.prefix + report + '.json', "w")
            json.dump(data,datafile, indent=4)
        if(args.csv):
            print('Creating CSV file ' + args.prefix + report + '.csv')
            dataFile = open(args.prefix + report + '.csv', "w")
            dataFile.write(data.encode('utf-8'))
        if(args.sql):
            ##############  GET THE LIST OF FIELDS IN THE DATA  ##############
            tableName = args.prefix + report
            print('Creating SQL table ' + tableName)
            fields = {}
            for record in data:
                for key in record:
                    if(type(record[key]) is not dict and key not in fields):
                        fields[key] = type(record[key]).__name__
                        if(fields[key] == 'unicode'):
                            ### Try to parse it as a date, if it passes, it is a datetime field.  There is a possibility this incorrectly classifies a field as a date, but it is quite low.
                            try:
                                date = datetime.strptime(record[key][0:10], '%Y-%m-%d')
                                fields[key] = 'datetime'
                            except Exception as e:
                                pass
                
                
            
            ##############  DROP & CREATE THE TABLE IN SQL  ##############
            sql = 'DROP TABLE IF EXISTS ' + tableName
            execSql(dbCur, dbCon, sql)
            
            sqlFieldListString = ','.join('[' + k + ']' for k in fields)
            sqlFieldDefinitionString = ','.join('[' + k + '] ' + pythonToSqlType(fields[k]) for k in fields)
            
            sql = 'CREATE TABLE ' + tableName + ' ('
            sql += sqlFieldDefinitionString
            sql += ')'
            execSql(dbCur, dbCon, sql)

            ##############  ADD THE DATA TO THE TABLE  ##############
            print('Adding data to SQL . . .'),
            i=0
            for record in data:
                fieldValueList = []
                for field in fields:
                    try:
                        fieldValueList.append("'" + cleanStr(record[field]) + "'")
                    except Exception as e:
                        # If the field is not found in the given record, put a null value in SQL
                        fieldValueList.append('null')
                        pass
                    
                fieldValueListString = ','.join(fieldValueList) 

                sql = 'INSERT INTO ' + tableName + ' ( ' + sqlFieldListString + ' ) '
                sql += ' VALUES (' + fieldValueListString + ')'
                i = i + 1
                if(i == 100):
                    i = 0
                    print (".", end="")      
                    sys.stdout.flush()
                execSql(dbCur, dbCon, sql)
            print("\n\n")
            
        
    


###################################################################################
################  FUNCTION TO GET DATA FROM API WITH CONTINUATION  ################
###################################################################################
def getDataFromAPI(url, authorization, continuation, format):
    trial = 10
    while(trial >= 0):
        trial = trial - 1
        if(trial <= 0):
            print('Unable to get data from ' + url)         
            raise
        else:
            try:
                req = Request(url)
                req.add_header('Python-Script-Version', '1.1')
                if(authorization != ''):
                    req.add_header('Authorization', authorization)
                if(continuation != ''):
                    req.add_header('Continuation', continuation)
                if(format == 'csv'):
                    req.add_header('Accept', 'text/csv')
                    
                response = urlopen(req)                
                if(response.getcode() == 200):
                
                    print (".", end="")      
                    sys.stdout.flush()
                    
                    if(format == 'json'):
                        data = json.loads(response.read().decode('utf-8'))
                    else:
                        data = (response.read().decode('utf-8'))
                    headers = response.info()
                    continuationHeader = ''
                    try:
                        # If there is a continuation header, there is more data, so call this again to get more data
                        continuationHeader = headers['Continuation']
                        data = data + getDataFromAPI(url, authorization, continuationHeader, format)
                    except Exception as e:
                        pass
                else:
                    data = None
                return data
            except HTTPError as e:
                print( 'Failure(' + str(e.code) + ') on attempt ' + str(10-trial) + ' to get data from ' + url)
                continue

       
###################################################################################
#########################  FUNCTION TO RUN A SQL COMMAND  #########################
###################################################################################    
def execSql(cursor, conn, sql):
    trial = 10
    while(trial >= 0):
        if(trial == 0):
            print('Major error trying to reconnect to database')
        
        trial = trial - 1
        try:
            cursor.execute(sql)
            conn.commit()
            return
        except pymssql.DatabaseError as e:
            try:
                print( 'Error with SQL: ' + sql)
                print(  str(e))
                print( 'Attempting to reconnect to database')
                time.sleep(3)
                dbCon = pymssql.connect(server=SQL_SERVER, user=SQL_USERNAME, password=SQL_PASSWORD, database=SQL_DB, tds_version='8.0', port='1433', charset="UTF-8")                
                print( 'connected')
                dbCur = dbCon.cursor()
                conn = dbCon
                cursor = dbCur    
                print( 'continuing')  
            except Exception as e:
                print( 'Failed to reconnect')
                print( str(e))
                pass
            pass
            
            

####################################################################################
####  FUNCTION TO EXCHANGE USERNAME+PASSWORD OR REFRESH TOKEN FOR BEARER TOKEN  ####
####################################################################################   
def getBearerToken(tenant):
    if(REFRESH_TOKEN != ''):
        try:
            url = 'https://auth.seismic.com/tenants/' + tenant + '/connect/token'

            data = {'client_id': CLIENT_ID,
            'client_secret' : CLIENT_SECRET,
            'grant_type' : 'refresh_token',
            'refresh_token' : REFRESH_TOKEN
            }
            response = urlopen(url, urlencode(data).encode("utf-8"))

            d = json.loads(response.read().decode('utf-8'))
            
            return 'Bearer ' + d['access_token']
        except HTTPError as e:
            print(str(e))
            print(e.read())
            quit()
        except Exception as e:
            print('Unable to get bearer token from tenant: [' + tenant + '] using refresh token flow')
            print(str(e))
            quit()
    else:
        try:
            url = 'https://auth.seismic.com/tenants/' + tenant + '/connect/token'

            data = {'client_id': CLIENT_ID,
            'client_secret' : CLIENT_SECRET,
            'grant_type' : 'password',
            'scope' : 'seismic.reporting',
            'username' : TENANT_USERNAME,
            'password' : TENANT_PASSWORD
            }
            response = urlopen(url, urlencode(data).encode("utf-8"))

            d = json.loads(response.read().decode('utf-8'))
            return 'Bearer ' + d['access_token']
        except HTTPError as e:
            print(str(e))
            print(e.read())
            quit()
        except Exception as e:
            print('Unable to get bearer token from tenant: [' + tenant + '] using password flow')
            print(str(e))
            quit()


        
            
            
def parse_args():

    parser = argparse.ArgumentParser(
        description='Get data from the Seismic reporting APIs into JSON, CSV and/or SQL.  An example to get all data into both SQL and JSON would be getSeismicData.py --all --sql --json',
    )

    parser.add_argument(
        '-r',
        '--reports',
        help='A comma separated list of reports such as ' + ','.join(reportList[0:2]) + '. Note, these are case sensitive',
        type=str
    )
    parser.add_argument(
        '-s',
        '--startdate',
        help='The start date for the report such as 2017-01-01',
        default='',
        type=str
    )
    parser.add_argument(
        '-e',
        '--enddate',
        help='The end date for the report such as 2019-01-01',
        default='',
        type=str
    )
    parser.add_argument(
        '--prefix',
        help='Optional prefix to add to JSON filename or SQL table names',
        default='',
        type=str
    )

    parser.add_argument(
        '--all',
        action='store_true',
        help='Runs all reports to individual JSON or SQL tables'
    )    
    
    parser.add_argument(
        '--sql',
        action='store_true',
        help='Output data to SQL'
    )
    parser.add_argument(
        '--json',
        action='store_true',
        help='Output data to JSON'
    )
    parser.add_argument(
        '--csv',
        action='store_true',
        help='Output data to JSON'
    )

    
    args = parser.parse_args()
    
    if(args.sql == False and args.json == False and args.csv == False):
        print( 'Please specify either --sql or --json or --csv as output formats')
        quit()
        
    if((args.sql == True or args.json == True) and args.csv == True):
        print('You cannot export to CSV at the same time as JSON or SQL')
        quit()
        
    return parser.parse_args()
   
    
def cleanStr(s):
    if s is None:
        return ''
    elif type(s) == bool:
        return str(s)   
    elif type(s) == list:
        if(len(s) > 0):
            return '|' + cleanStr('|'.join(s)) + '|'
        else:
            return 'null'
    return unicode(s).replace(u"\u2018", "'").replace(u"\u00e2\u20ac\u2122", "'").replace(u"\u2019", "'").replace("'", "''").replace("&", "\&").encode('ascii','ignore').strip()

 

def pythonToSqlType(t):
    return {
        'unicode' : 'nvarchar(max)',
        'int' : 'bigint',
        'list' : 'nvarchar(max)',
        'bool' : 'bit',
        'datetime' : 'datetime'
    }[t]

def module_exists(module_name):
    try:
        __import__(module_name)
    except ImportError:
        return False
    else:
        return True    

if __name__ == "__main__":
    main()