Extracting Reporting Data with Python
Overview
If you like the idea of using the Seismic reporting APIs to get data into files or a database, but aren't quite sure how to go about doing that, the script below may be able to help!
The Python script below can extract data from any of the reporting APIs and save that data into JSON files or a SQL database. The script also provides a useful foundation to extend for other custom data extraction needs.
This script is intended to be used with the password authentication flow or the refresh token flow.
Before You Begin
Python and pip setup
In order to run this script you will need Python installed on your machine. We recommend using Python 3.7.1 or later, but the script will also work with Python 2.7.x.
You can download the latest version of python here.
In addition to Python, we strongly recommend you have pip, the Python package manager. If you have Python 3.7.1 or later, pip comes with your Python install. If you are using an older 2.7 version of Python, you may need to manually install pip from https://pypi.org/project/pip/
We also recommend occasionally updating pip using python -m pip install --upgrade pip
Installing dependencies
Once you have Python and pip installed on your machine, you should make sure you have all of the necessary dependencies installed. If you are using Python 3, many of these are already included, if you are using Python 2, you will need to install several manually.
For Python 3, run the following:
pip install pymssql
For Python 2, run the following:
pip install pymssql
pip install urllib
pip install json
pip install argparse
Usage
This script has a set of constants which need to be added directly to the script as described in the table below
Constant | Description |
---|---|
TENANT | The name of your tenant. If your Seismic url is acme.seismic.com, then acme is your tenant. |
CLIENT_ID | Your client_id for a password flow client. See Get Started for more info. |
CLIENT_SECRET | Your client_secret for a password flow client. See Get Started for more info. |
TENANT_USERNAME | The username of a user in your tenant you will use to run the APIs |
TENANT_PASSWORD | The direct login password for the user (note, this is not the same as the SSO password). This user typically requires direct login privileges in the tenant. Contact your customer success team for assistance if needed. |
REFRESH_TOKEN | If you would like to use a refresh token flow rather than password flow, provide your refresh token here. Note, you will need a multi-use refresh token. Leave this field blank if you want to use password flow. |
SQL_SERVER | The host for your SQL server if you want to replicate the data to a SQL server. Note, this script assumes the destination is a Microsoft SQL server using port 1433, although the script can easily be modified for use with other forms of SQL servers. |
SQL_DB | The name of the database to add tables to |
SQL_USERNAME | The username of the user in the SQL server |
SQL_PASSWORD | The password of the user in the SQL server |
Once these constants are added to the Python script, the script can be executed from the command line in a tool such as Windows PowerShell.
The typical way to run the script would be to run to following command python myscriptname.py --all --json
which will place all reports into your local folder in JSON format.
We also support the following arguments if you would like a bit more control over what the script runs
Command Line Argument | Description |
---|---|
-r or --reports | A comma separated list of reports to run from any of the reports listed on in the reporting APIs. This option should not be used if --all is used. |
--all | Runs all reports and saves the data to individual JSON or SQL tables |
-s or --startdate | The start date for the report such as 2017-01-01. Exclude this parameter if you want all data for the reports. |
-e or --enddate | The enddate for the report such as 2019-01-01. Exclude this parameter if you want all data for the reports. |
--sql | Include this parameter if you want data exported to SQL |
--json | Include this parameter if you want data exported to JSON |
--csv | Include this parameter if you want data exported to CSV |
--prefix | Optional prefix to add to JSON filename or SQL table names |
Script
- The main() function has the primary functional code for the script taking in arguments, authenticating, calling the APIs and saving the data to JSON and SQL.
- The getDataFromAPI() function includes the code necessary for extracting data from the APIs including the use and propagation of the Continuation header.
- The getBearerToken() function includes the code necessary to exchange credentials to get a bearer token
- The remainder of the functions in the script are largely supporting functions
from __future__ import print_function
import os, sys
import urllib
import json
from datetime import datetime
import time
import argparse
try: #Try Python3
from urllib.request import urlopen, Request
from urllib.error import HTTPError
from urllib.parse import urlencode, quote_plus
except ImportError: # Fallback to Python 2
from urllib2 import urlopen, Request
from urllib import urlencode
from urllib2 import HTTPError
TENANT=''
CLIENT_ID=''
CLIENT_SECRET=''
TENANT_USERNAME=''
TENANT_PASSWORD=''
REFRESH_TOKEN='' # Set this only if you want to use a refresh_token flow instead of password flow
SQL_SERVER=''
SQL_DB=''
SQL_USERNAME=''
SQL_PASSWORD=''
reportList = [
'contentProfileAssignments',
'contentProfiles',
'contentProperties',
'contentPropertyAssignments',
'contentUsageHistory',
'contentViewHistory',
'contents',
'externalUsers',
'generatedLivedocComponents',
'generatedLivedocFields',
'generatedLivedocOutputFormats',
'generatedLivedocSlides',
'generatedLivedocs',
'groupMembers',
'groups',
'libraryContentVersions',
'libraryContents',
'livesendLinkContents',
'livesendLinkMembers',
'livesendLinks',
'livesendPageViews',
'livesendViewingSessions',
'searchHistory',
'searchWords',
'teamsites',
'userActivity',
'userProperties',
'userPropertyAssignments',
'users',
'workspaceContentVersions',
'workspaceContents'
]
###################################################################################
############ MAIN FUNCTION TO AUTH, RUN API, SAVE DATA TO JSON + SQL ############
###################################################################################
def main():
args = parse_args()
############## LOGIN USING DIRECT LOGON CREDENTIALS TO GET AN ACCESS TOKEN ##############
bearer = getBearerToken(TENANT)
############## CONNECT TO SQL IF SPECIFIED ##############
if(args.sql):
try:
import pymssql
except:
print("You do not have pymssql installed. Please try to run the following command: pip install pymssql")
quit()
try:
dbCon = pymssql.connect(server=SQL_SERVER, database=SQL_DB, user=SQL_USERNAME, password=SQL_PASSWORD, tds_version='8.0', port='1433', charset="UTF-8")
dbCur = dbCon.cursor()
except Exception as e:
print('Error trying to connect to database. SERVER:' + SQL_SERVER + ' DATABASE:' + SQL_DB + ' USER:' + SQL_USERNAME + ' PASSWORD:' + SQL_PASSWORD)
print(str(e))
quit()
############## GET THE LIST OF REPORTS TO RUN ##############
try:
if(args.all):
reports = reportList
else:
reports = args.reports.split(',')
except Exception as e:
print('You must specify a comma separated list of reports using --report or specify --all to run all reports')
############## RUN EACH OF THE REPORTS ##############
for report in reports:
url = 'https://api.seismic.com/reporting/v2/' + report + '?occuredAtStartTime=' + args.startdate + '&occuredAtEndTime=' + args.enddate
print( "Getting data from url: " + url)
if(args.csv):
format = 'csv'
else:
format = 'json'
data = getDataFromAPI(url, bearer, '', format)
print('')
if(args.json):
print('Creating json file ' + args.prefix + report + '.json')
dataFile = open(args.prefix + report + '.json', "w")
json.dump(data,datafile, indent=4)
if(args.csv):
print('Creating CSV file ' + args.prefix + report + '.csv')
dataFile = open(args.prefix + report + '.csv', "w")
dataFile.write(data.encode('utf-8'))
if(args.sql):
############## GET THE LIST OF FIELDS IN THE DATA ##############
tableName = args.prefix + report
print('Creating SQL table ' + tableName)
fields = {}
for record in data:
for key in record:
if(type(record[key]) is not dict and key not in fields):
fields[key] = type(record[key]).__name__
if(fields[key] == 'unicode'):
### Try to parse it as a date, if it passes, it is a datetime field. There is a possibility this incorrectly classifies a field as a date, but it is quite low.
try:
date = datetime.strptime(record[key][0:10], '%Y-%m-%d')
fields[key] = 'datetime'
except Exception as e:
pass
############## DROP & CREATE THE TABLE IN SQL ##############
sql = 'DROP TABLE IF EXISTS ' + tableName
execSql(dbCur, dbCon, sql)
sqlFieldListString = ','.join('[' + k + ']' for k in fields)
sqlFieldDefinitionString = ','.join('[' + k + '] ' + pythonToSqlType(fields[k]) for k in fields)
sql = 'CREATE TABLE ' + tableName + ' ('
sql += sqlFieldDefinitionString
sql += ')'
execSql(dbCur, dbCon, sql)
############## ADD THE DATA TO THE TABLE ##############
print('Adding data to SQL . . .'),
i=0
for record in data:
fieldValueList = []
for field in fields:
try:
fieldValueList.append("'" + cleanStr(record[field]) + "'")
except Exception as e:
# If the field is not found in the given record, put a null value in SQL
fieldValueList.append('null')
pass
fieldValueListString = ','.join(fieldValueList)
sql = 'INSERT INTO ' + tableName + ' ( ' + sqlFieldListString + ' ) '
sql += ' VALUES (' + fieldValueListString + ')'
i = i + 1
if(i == 100):
i = 0
print (".", end="")
sys.stdout.flush()
execSql(dbCur, dbCon, sql)
print("\n\n")
###################################################################################
################ FUNCTION TO GET DATA FROM API WITH CONTINUATION ################
###################################################################################
def getDataFromAPI(url, authorization, continuation, format):
trial = 10
while(trial >= 0):
trial = trial - 1
if(trial <= 0):
print('Unable to get data from ' + url)
raise
else:
try:
req = Request(url)
req.add_header('Python-Script-Version', '1.1')
if(authorization != ''):
req.add_header('Authorization', authorization)
if(continuation != ''):
req.add_header('Continuation', continuation)
if(format == 'csv'):
req.add_header('Accept', 'text/csv')
response = urlopen(req)
if(response.getcode() == 200):
print (".", end="")
sys.stdout.flush()
if(format == 'json'):
data = json.loads(response.read().decode('utf-8'))
else:
data = (response.read().decode('utf-8'))
headers = response.info()
continuationHeader = ''
try:
# If there is a continuation header, there is more data, so call this again to get more data
continuationHeader = headers['Continuation']
data = data + getDataFromAPI(url, authorization, continuationHeader, format)
except Exception as e:
pass
else:
data = None
return data
except HTTPError as e:
print( 'Failure(' + str(e.code) + ') on attempt ' + str(10-trial) + ' to get data from ' + url)
continue
###################################################################################
######################### FUNCTION TO RUN A SQL COMMAND #########################
###################################################################################
def execSql(cursor, conn, sql):
trial = 10
while(trial >= 0):
if(trial == 0):
print('Major error trying to reconnect to database')
trial = trial - 1
try:
cursor.execute(sql)
conn.commit()
return
except pymssql.DatabaseError as e:
try:
print( 'Error with SQL: ' + sql)
print( str(e))
print( 'Attempting to reconnect to database')
time.sleep(3)
dbCon = pymssql.connect(server=SQL_SERVER, user=SQL_USERNAME, password=SQL_PASSWORD, database=SQL_DB, tds_version='8.0', port='1433', charset="UTF-8")
print( 'connected')
dbCur = dbCon.cursor()
conn = dbCon
cursor = dbCur
print( 'continuing')
except Exception as e:
print( 'Failed to reconnect')
print( str(e))
pass
pass
####################################################################################
#### FUNCTION TO EXCHANGE USERNAME+PASSWORD OR REFRESH TOKEN FOR BEARER TOKEN ####
####################################################################################
def getBearerToken(tenant):
if(REFRESH_TOKEN != ''):
try:
url = 'https://auth.seismic.com/tenants/' + tenant + '/connect/token'
data = {'client_id': CLIENT_ID,
'client_secret' : CLIENT_SECRET,
'grant_type' : 'refresh_token',
'refresh_token' : REFRESH_TOKEN
}
response = urlopen(url, urlencode(data).encode("utf-8"))
d = json.loads(response.read().decode('utf-8'))
return 'Bearer ' + d['access_token']
except HTTPError as e:
print(str(e))
print(e.read())
quit()
except Exception as e:
print('Unable to get bearer token from tenant: [' + tenant + '] using refresh token flow')
print(str(e))
quit()
else:
try:
url = 'https://auth.seismic.com/tenants/' + tenant + '/connect/token'
data = {'client_id': CLIENT_ID,
'client_secret' : CLIENT_SECRET,
'grant_type' : 'password',
'scope' : 'seismic.reporting',
'username' : TENANT_USERNAME,
'password' : TENANT_PASSWORD
}
response = urlopen(url, urlencode(data).encode("utf-8"))
d = json.loads(response.read().decode('utf-8'))
return 'Bearer ' + d['access_token']
except HTTPError as e:
print(str(e))
print(e.read())
quit()
except Exception as e:
print('Unable to get bearer token from tenant: [' + tenant + '] using password flow')
print(str(e))
quit()
def parse_args():
parser = argparse.ArgumentParser(
description='Get data from the Seismic reporting APIs into JSON, CSV and/or SQL. An example to get all data into both SQL and JSON would be getSeismicData.py --all --sql --json',
)
parser.add_argument(
'-r',
'--reports',
help='A comma separated list of reports such as ' + ','.join(reportList[0:2]) + '. Note, these are case sensitive',
type=str
)
parser.add_argument(
'-s',
'--startdate',
help='The start date for the report such as 2017-01-01',
default='',
type=str
)
parser.add_argument(
'-e',
'--enddate',
help='The end date for the report such as 2019-01-01',
default='',
type=str
)
parser.add_argument(
'--prefix',
help='Optional prefix to add to JSON filename or SQL table names',
default='',
type=str
)
parser.add_argument(
'--all',
action='store_true',
help='Runs all reports to individual JSON or SQL tables'
)
parser.add_argument(
'--sql',
action='store_true',
help='Output data to SQL'
)
parser.add_argument(
'--json',
action='store_true',
help='Output data to JSON'
)
parser.add_argument(
'--csv',
action='store_true',
help='Output data to JSON'
)
args = parser.parse_args()
if(args.sql == False and args.json == False and args.csv == False):
print( 'Please specify either --sql or --json or --csv as output formats')
quit()
if((args.sql == True or args.json == True) and args.csv == True):
print('You cannot export to CSV at the same time as JSON or SQL')
quit()
return parser.parse_args()
def cleanStr(s):
if s is None:
return ''
elif type(s) == bool:
return str(s)
elif type(s) == list:
if(len(s) > 0):
return '|' + cleanStr('|'.join(s)) + '|'
else:
return 'null'
return unicode(s).replace(u"\u2018", "'").replace(u"\u00e2\u20ac\u2122", "'").replace(u"\u2019", "'").replace("'", "''").replace("&", "\&").encode('ascii','ignore').strip()
def pythonToSqlType(t):
return {
'unicode' : 'nvarchar(max)',
'int' : 'bigint',
'list' : 'nvarchar(max)',
'bool' : 'bit',
'datetime' : 'datetime'
}[t]
def module_exists(module_name):
try:
__import__(module_name)
except ImportError:
return False
else:
return True
if __name__ == "__main__":
main()
Updated almost 2 years ago