historical-powertrack
historical-powertrack

Historical PowerTrack

Methods

Method Description
POST /jobs Create a new Historical PowerTrack job
GET /jobs/:uuid Monitor the status of a single Historical PowerTrack job
PUT /jobs/:uuid Accept or reject a Historical PowerTrack job that has been estimated
GET /results Retrieve the list of URLs for a finished job. These URLs can then be used to download the data files.
Download Twitter Data Files Download the Twitter data files generated for a complete Historical PowerTrack job.
GET /jobs Check the status of all your Historical PowerTrack jobs

Support for Tweet edits - Since Historical PowerTrack only delivers Tweets from at least 30 minutes ago, all Tweets returned will reflect their final edit state. All Tweet objects for Tweets posted starting August 17, 2022, will include metadata that describes its edit history. See the "Tweet edits" fundamentals page for more details.

Authentication

All requests to the Historical PowerTrack API must use HTTP Basic Authentication, constructed from a valid email address and password combination used to log into your account at console.gnip.com. Credentials must be passed as the Authorization header for each request. Make sure your client is adding the "Authentication: Basic" HTTP header (with encoded credentials over HTTPS) to all API requests.

Job Stages

Below are the stages that a job may have as it goes through the life cycle. Job status messages returned by the API may vary from the table below.

Status Status Message(s)
opened Waiting on quote from Gnip.
estimating
  • Generating a job quote.
  • Request to auto-estimate queued.
  • Auto-estimate job is running.
  • Auto-estimate job has run, time to do some math.
quoted Job quoted and awaiting customer acceptance.
accepted Job accepted and ready to be queued.
rejected Job quote rejected by customer.
running
  • Job is being processed.
  • Job completed and being validated prior to customer delivery.
  • Job paused.
completed Job completed and awaiting validation.
delivered Job delivered and available for download
failed
  • Job completed but failed validation. Twitter investigating.
  • Failure running job. Twitter investigating.
  • Auto-estimate job has failed. Twitter investigating.

Example Job Timeline:

2017-03-07 19:42 UTC Job Created 2017-03-07 19:50 UTC Estimation Completed 2017-03-07 19:51 UTC Quoted 2017-03-07 19:52 UTC Accepted 2017-03-07 19:53 UTC Started Running 2017-03-07 21:21 UTC Completed Running 2017-03-22 21:21 UTC Expires

Important Note: "Failed" jobs should not be resubmitted. The Twitter Data team will fix this. If you resubmit a failed job, you'll end up with two jobs and pay for the data twice. Assume that a failed job has the team's attention and that Twitter will fix it.

POST /jobs 

Creates a new Historical PowerTrack job.

Request Specifications

HTTP Method POST
URL https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json
Content-Type "application/json"
Note: This should be specified in the header of the request.
Request Body Format JSON
Request Body Size Limit 1000 rules
Rate Limits A maximum of 60 Jobs can be created per (UTC) day.
A maximum of 30 Jobs can be created per hour.
A maximum of 2 Jobs can be estimating concurrently.
A maximum of 2 Jobs can be running concurrently.

Request Body Elements

Parameter Example Comment
publisher "publisher":"twitter" The data publisher you want the historical job to use. Currently Twitter only.
dataFormat "dataFormat":"original" The data format to use for the job. Enter "original" to have your data delivered in the enriched native format.
fromDate "fromDate":"201201010000" A timestamp indicating the start time of the period of interest. Timestamp is specified in the UTC time zone and has a 'YYYYMMDDHHMM' format (minute granularity). This date is inclusive, meaning the minute specified will be included in the job.
toDate "toDate":"201201020000" A timestamp indicating the end time of the period of interest. Timestamp is specified in the UTC time zone and has a 'YYYYMMDDHHMM' format (minute granularity). This is NOT inclusive, so a time of 00:00 will return data through 23:59 of previous day. I.e. the specified minute will not be included in the job, but the minute immediately preceding it will In the UTC time zone.
title "title":"my_job123" A title for the historical job. This must be unique and jobs with duplicate titles will be rejected.
rules "rules":[ { "value":"cow" }, { "value":"dog", "tag":"pets" }] The rules which will determine what data is returned by your job. Historical PowerTrack jobs support up to 1000 PowerTrack rules. For information on PowerTrack rules, see our documentation here, and be sure to consider the caveats listed here.

Specifying the Correct Time Window

To ensure that your Historical PowerTrack job retrieves all the data you need, you should ensure that your app correctly applies the fromDate and toDate parameters. A common customer error is to set the toDate one minute earlier than it should be, resulting in the job failing to retrieve the last minute of data desired. This is because toDate is exclusive, where fromDate is inclusive.

To further illustrate, consider the following example, in which the customer desires to retrieve data from minutes 55, 56, 57, 58, and 59 of a given hour. The customer must use 06:00 as the toDate to ensure that they capture Tweets from minute 59. Tweets from 06:00 are not included in the Twitter data that is delivered.

To extend this to a more practical example, imagine a Historical PowerTrack job that required data for every minute of January 1, 2014. To capture each of these minutes, the job should have a fromDate of 201401010000 (01-01-2014 00:00), and a toDate of 201401020000 (01-02-2014 00:00 -- minute 0 of January 2). This will ensure minutes 00:00-23:59 are delivered in the results.

Understanding Number of "Historical Powertrack Days"

A "Historical Powertrack Day" starts at UTC 0000 and ends at 2359. Inclusion of any minute of a "Historical Powertrack Day" in a job would count as a day.

A job with "fromDate":"201201010000" and "toDate":"201201020000" would count as 1 day.
A job with "fromDate":"201201010001" and "toDate":"201201020001" would count as 2 days.
A job with "fromDate":"201201010000" and "toDate":"201201020001" would count as 2 days.

Request Body Example

{
    "publisher":"Twitter",
    "dataFormat":"original",
    "fromDate":"201201010000",
    "toDate":"201201010001",
    "title":"my_job123",
    "rules":
        [
            {
                "value":"cat"
            }
        ]
}

Example cURL Request

curl -v POST -u [email protected]
    'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json'
    -H 'Content-type: application/json'
    -d '{"publisher":"twitter","dataFormat":"original","fromDate":"201609212359",
    "toDate":"201609230000","title":"testjob12345","rules":[{"tag":"tag_name","value":"rule"}]}'

Response Example

{
    "title":"my_job",
    "account":"gnip-customer",
    "publisher":"Twitter",
    "streamType":"track_v2",
    "format":"original",
    "fromDate":"201201010000",
    "toDate":"201201010001",
    "requestedBy":"[email protected]",
    "requestedAt":"2012-06-13T22:43:27Z",
    "status": "open",
    "statusMessage": "Waiting on quote from Gnip.",
    "jobURL": "https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json"
}

Common Errors:

    {
    "status": "error",
    "reason": "Accepting this job will cause you to exceed your monthly activities cap"
    }

GET /jobs/uuid 

Monitors the status of a historical job.

After a job is created, you can use this request to monitor the current status of the specific job.

When the job is in the process of generating an estimate of expected order of magnitude of activity volume and time required, this request provides insight into the progress in the estimating process. Once this estimate is complete, the response will indicate the volume and time estimates referenced.

After the job has been accepted, the response can be used to track its progress as the data files are generated.

Request Specifications

HTTP Method GET
URL https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json
Rate Limit 5 requests per 5 seconds aggregated across all GET requests.

Estimate Fields

These fields are returned in the response for this request after a job has been estimated. They should be used to evaluate whether or not to accept the job and run it to completion.

Field Description
costDollars The cost of the job. Note that "costDollars" will not be present if you have a subscription to Historical PowerTrack, as the cost is already included in your subscription.
expiresAt Indicates how long this job will be available for you to either accept or reject it.
estimatedActivityCount An order-of-magnitude estimate of the number of results (tweets) this job is expected to return if accepted. The minimum value is 100 activities. If the sample used to generate the estimate returns no matches, this value will be 100 activities.
estimatedDurationHours An estimate of the amount of time that will be required to run the full job, if accepted.
estimatedFileSizeMb An estimate of the total compressed file size for the entire job after it is completed.

Example cURL Request

curl -v GET -u [email protected] 
    'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json'

Response Examples

After the Estimate is Completed:

{
    "title":"my_job",
    "account":"gnip-customer",
    "publisher":"Twitter",
    "streamType":"track_v2",
    "format":"original",
    "fromDate":"201201010000",
    "toDate":"201201010001",
    "requestedBy":"[email protected]",
    "requestedAt":"2012-06-13T22:43:27Z",
    "status": "quoted",
    "statusMessage": "Job quoted and awaiting customer approval.",
    "jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
    "quote": {
        "costDollars": 5000,
        "estimatedActivityCount":10000,
        "estimatedDurationHours":72,
        "estimatedFileSizeMb": 10.0,
        "expiresAt": "2012-06-16T22:43:00Z"
    },
    "percentComplete": 0
}

While the Accepted Job is Running

{
    "title":"my_job",
    "account":"gnip-customer",
    "publisher":"Twitter",
    "streamType":"track_v2",
    "format":"original",
    "fromDate":"201201010000",
    "toDate":"201201010001",
    "requestedBy":"[email protected]",
    "requestedAt":"2012-06-13T22:43:27Z",
    "status": "delivered",
    "statusMessage": "Job delivered and available for download.",
    "jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
    "quote":
    {
        "costDollars": 5000,
        "estimatedActivityCount":10000,
        "estimatedDurationHours":72,
        "estimatedFileSizeMb": 10.0,
        "expiresAt": "2012-06-16T22:43:00Z"
    },
    "acceptedBy": "[email protected]",
    "acceptedAt": "2012-06-14T22:43:27Z",
    "percentComplete": 100,
    "results":
    {
        "completedAt":"2012-06 17T22:43:27Z",
        "activityCount":1200,
        "fileCount":1000,
        "fileSizeMb":10.0,
        "dataURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json"
        "expiresAt": "2012-06-30T22:43:00Z"
    }
}

PUT /jobs/uuid 

Accepts or rejects a historical job in the "quoted" stage.

IMPORTANT: Accepted jobs will be started by Twitter's system, and cannot be stopped after acceptance. Rejected jobs cannot be recovered.

Request Specifications

HTTP Method PUT
URL https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json
Content-Type "application/json"

Note: This should be specified in the header of the request.

Accept Request Body Example

{
  "status":"accept"
}

Reject Request Body Example

{
  "status":"reject"
}

Response

After a job is either accepted or rejected, the job will include an “acceptedAt” field, which indicates the time of that action. This field records the date regardless of whether the job was accepted or rejected, and should not be interpreted as an indicator of the type of action you took on it. The proper field for this type of info is the “status” field.

Example cURL Request

curl -v PUT -u [email protected] 
    'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json' 
    -H 'Content-type: application/json' 
    -d '{"status":"accept"}'

Response Example

{
     "title":"my_job",
     "account":"gnip-customer",
     "publisher":"Twitter",
     "streamType":"track_v2",
     "format":"original",
     "fromDate":"201201010000",
     "toDate":"201201010001",
     "requestedBy":"[email protected]",
     "requestedAt":"2012-06-13T22:43:27Z",
     "status": "accepted",
     "statusMessage": "Job accepted and ready to be queued.",
     "jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
     "quote": {
           "costDollars": 5000,
           "estimatedActivityCount":10000,
           "estimatedDurationHours":72,
           "estimatedFileSizeMb": 10.0,
           "expiresAt": "2012-06-16T22:43:00Z"
      },
     "acceptedBy": "[email protected]",
     "acceptedAt": "2012-06-14T22:43:27Z",
     "percentComplete": 0
}

GET /results 

Retrieves information about a completed Historical PowerTrack job, including a list of URLs which correspond to the data files generated for a completed historical job. These URLs will be used to download the Twitter data files.

Note: Results must be downloaded within 15 days of job completion. Expired jobs are not recoverable.

Request Specifications

HTTP Method GET
URL Found in the dataURL field returned in the response from the GET /jobs/:uuid request for a completed job. The URL resembles the following structure:
https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.json
Response Format JSON
Rate Limit 5 requests per 5 seconds aggregated across all GET requests.

Response Elements

For a completed job, the response for this request will include the following elements in the JSON payload.

Field Description
urlCount The number of data files from the job.
urlList A list of urls representing each data file from the job. Each file represents a 10-minute period of data during the job’s requested timeframe. If a there was no matching data for a 10-minute period, the file representing that period will be omitted from the list.

These URLs should be used when downloading the Twitter data for the job.
totalFileSizeBytes The total file size of the data files in bytes.
suspectMinutesURL Optional. There are minute periods in Gnip’s historical corpus where there might not be 100% data fidelity due to unexpected occurrences like Twitter downtime or Twitter connectivity/API issues. This suspectMinutesURL field will only be present in the payload if there are suspect minutes during the job’s specified timeframe. If this field is present, a GET request with your Gnip Console credentials will return a list of the suspect minutes within the timeframe.
expiresAt This is the time at which the results for this job will no longer be available for download.

Example cURL Request

curl -v GET -u [email protected] 
    'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.json'

Response Example

{
    "urlCount":5,
    "urlList":[
        "https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
        "https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
        "https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
        "https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?...",
        "https://archive.replay.historical.review.s3.amazonaws.com/historical/twitter/track/original/gnip-customer/2012/08/29/20120701-20120701_jbmfgctzw5/2012/07/01/00/10_activities.json.gz?..."
    ],
    "totalFileSizeBytes":1146952,
    "suspectMinutesUrl":
    "https://archive.replay.historical.review.s3.amazonaws.com/customers/gnip-biz/20120101-20120101_5zzax1awxq/suspectMinutes.json?..."
    "expiresAt":"2012-11-24T18:53:23Z"
}

In addition to the JSON file, we also generate a CSV version that can be used where more convenient. The CSV file contains a series of lines separated by 'n', where each line is a local file name for the job file, the tab ('t') character, and the S3 url for the file.

The CSV can be used for the same purpose as the JSON version when downloading the historical job. In the "downloading" section below, example curl commands use the csv version, rather than the JSON version, but both types will deliver the same results. This list can also be downloaded by navigating to the “dataURL” in your browser while logged in to the Gnip console.

Note: All data generated by Historical PowerTrack are in JSON format. The CSV format referenced here refers only to the list of download links, not the actual Twitter data itself.

Download Twitter Data Files 

Downloads a Gzip compressed JSON file of Twitter data for a completed Historical PowerTrack job.

Historical PowerTrack jobs generate 1 file for each 10-minute segment of time covered by the job, except where there is no matching Twitter data found within that 10-minute segment. For example, a Historical PowerTrack job covering 1 hour would contain 6 separate files, which would each need to be downloaded via separate requests.

The response to the GET /results request will include a 'urlCount' field that indicates the total number of files to download, and the URLs corresponding to each file. The request described below must be repeated for each file URL listed in the 'results' response, although the requests can be made in parallel. Upon completion of all downloads, your app should ensure that the number of files downloaded matches the 'urlCount' from the 'results' response..

Historical data files are available for 15 days from when the job is accepted. After 15 days the data will no longer be accessible.

For details on the format of the Tweets, click here.

Request Specifications

HTTP Method GET
URL URLs for each file are provided in the response to the GET /results method for a completed job. The request must be repeated for each file URL listed there.
File Compression Gzip

Downloading Files

cURL is a simple command line tool for sending HTTP requests, and provides an easy way to download files for Historical PowerTrack jobs. cURL is native to Linux and Mac OS, and may be installed on Windows in combination with the CygWin package. See our article here for details on this.

For examples demonstrating how to send the types of HTTP requests described below, see the below links to code examples:

Download script (bash)

  • The bash script above can be used to download Historical PowerTrack data files

rbHistoricalPT (ruby)

Download files using cURL command (Option 1 of 2)

This command performs both the GET /results and GET /activities requests to download all the Twitter data files for a Historical PowerTrack job. Downloading of the files is performed in parallel into a local directory. As each file is downloaded, the command will print the command being used.

When using this command, be sure to use the CSV form of the results file (results.csv), rather than the JSON results file (results.json). The 'name' column of all_files.csv is used to name the individual files.

curl -sS -u [email protected]:password https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.csv | xargs -P 8 -t -n2 curl -o

If you experience issues with the above command, it is often useful to try downloading just the first data file to help debug. Again, when using this command, be sure to use the CSV form of the results file (results.csv), rather than the JSON results file (results.json).

curl -sS -u [email protected]:password https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}/results.csv | head -1 | xargs -P 8 -t -n2 curl -o

Download files using bash script (option 2 of 2)

Another option for downloading Historical PowerTrack data files is by using a bash script. One advantage of using this script is that it has 'smarts' when restarting a download cycle. If your download cycle gets interrupted for any reason, the script will inspect what files it has already downloaded and download only the files that are not available locally.

To learn more about this bash script, please read our Downloading Historical PowerTrack Files documentation.

GET /jobs 

Retrieves details for all historical PowerTrack jobs which are not expired for the given account.

This request can be used to pull the list of all Historical PowerTrack jobs for the customer's account, as well as aggregate metrics related to Historical PowerTrack usage. Expired jobs are not included in the list of jobs returned. Jobs which are not accepted expire 15 days after being estimated. Jobs that are accepted will expire in 15 days after acceptance.

For aggregate usage data, see the delivered, jobCount, jobDaysRun, and activityCount fields.

Request Specifications

HTTP Method GET
URL https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json
Rate Limit 5 requests per 5 seconds aggregated across all GET requests.

Example cURL Request

curl -v GET -u [email protected] 
    'https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs.json'

Example Response

{
    "jobs":[{
        "title": "my_job1",
        "jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
        "status": "running",
        "publisher":"Twitter",
        "streamType":"track_v2",
        "fromDate":"201201010000",
        "toDate":"201201010001"
        "percentComplete": 40
        "expiresAt":"2012-11-14T21:21:15Z"
        },{
        "title": "my_job2",
        "jobURL":"https://gnip-api.gnip.com/historical/powertrack/accounts/{ACCOUNT_NAME}/publishers/twitter/jobs/{JOB_UUID}.json",
        "status": "delivered",
        "publisher":"Twitter",
        "streamType":"track_v2",
        "fromDate":"201201010000",
        "toDate":"201201010001",
        "percentComplete": 100
        "expiresAt":"2012-11-16T10:15:03Z"
        }],
        "delivered":{
        "jobCount":3,
        "jobDaysRun":73,
        "activityCount":15958
    }
}