API for cloud task
It supports cloud scrapping operations such as starting and stopping tasks and pulling scraped data.
Comparing with local scrapping, cloud scrapping has the advantages of full hosting, no need to buy computer, high performance and rapid scrapping, and easy integration with enterprise/individual workflow.
- Concept description
Concept | Description |
---|---|
Task | Used to configure a collection of scrapping steps and rules. Different target websites are usually configured to different tasks. For example, scrapping Amazon product category data can be configured to one task, while scrapping Amazon product list and details can be configured to another task. Keyword in interfaces and fields is task . |
Batch | Record the running history of the task, and a new batch will be generated when the task is started. Keyword in interfaces and fields is batch . |
Job | When starting a task, the task can be divided into multiple jobs to run concurrently, so as to improve the scrapping efficiency and shorten the scrapping time. Now only the Loop step, and it's Loop Type is Loop List , and List Type is URL List or Text List , can be divided. In addition, in order to reduce the complexity of task configuration, only the outermost step is supported, but not the loop nested in the inner layer.Keyword in interfaces and fields is job . |
Step | A logical unit in task configuration, such as Navigate , Input Text , Loop .Keyword in interfaces and fields is action . |
Agent | An agent that runs the scrapping task in the cloud server. Keyword in interfaces and fields is agent . |
POST Start Task
Start a task on cloud.
POST /cloud/start
Body Parameters
{
"taskID": "9b57f4b8-c5d5-44ac-b4be-52ea63ae4742",
"region": "us",
"nodeCount": 1,
"jobCount": 4,
"actionParams": "[{\"id\": \"0d2904db-b128-4862-ad46-708244b2c753\",\"value\": \"https://www.coolparse.com\"},{\"id\": \"87f25dc4-5d5f-47dc-a378-f5bc9f3a7fbc\",\"value\": \"some text\"},{\"id\": \"a7ce70e2-5932-4b07-a910-b0bba80ea2c6\",\"value\": 22},{\"id\": \"fb4ead12-99bc-438b-9c81-29bb547c9a33\",\"value\": [\"https://www.coolparse.com/loop1\",\"https://www.coolparse.com/loop2\",\"https://www.coolparse.com/loop3\",\"https://www.coolparse.com/loop4\"]}]"
}
Params
Name | Location | Type | Required | Description |
---|---|---|---|---|
Authorization | header | string | yes | none |
x-cp-request-id | header | string | yes | none |
body | body | object | no | none |
» taskID | body | string | yes | Task ID copy from Task Overview. |
» region | body | string | no | Running region of cloud agent. |
» nodeCount | body | integer | yes | Number of concurrence agent. |
» jobCount | body | integer | yes | Number of spliting job. |
» actionParams | body | string | no | Parameters for replacing the value of action. |
Description
» region: Running region of cloud agent.
Empty means default region. Available regions,
- us United States
- gb Great Britain
- hk Hong Kong (China)
- jp Japan
» actionParams: Parameters for replacing the value of action.
- Available
step
and default field list
Action | Field Name | Field Path | Value Type | Description |
---|---|---|---|---|
Navigate | URL | config.url | string | Suitable for scenarios such as passing different URL parameters at running task. |
Input | Input Content | config.text | string | Suitable for scenarios such as inputting different search keywords or text parameters at running task. |
Loop | Loop times | config.times | int | Available for 'Loop by Count' in Loop Type. |
Loop | URL List | config.urls | string array | Available for 'Loop List' in Loop Type, and 'URL List' in List Type, and 'Manual Input' in URL Source. |
Loop | Text List | config.texts | string array | Available for 'Loop List' in Loop Type, and 'Text List' in List Type. |
Loop | Custom | config.times | int | Available for 'Loop by page' Loop Type, and 'Page times' mode is 'Custom'. |
Loop | Custom | config.times | int | Available for 'Loop by scroll' Loop Type, and 'Scroll times' mode is 'Custom'. |
- Data structure of parameter object
You should encode the JSON array to string for actionParams parameter.
Field | Description |
---|---|
id | ID of action, you can copy ID by Click to copy step ID in Task Editor -> Edit Action . |
path | JSON path of the object. Empty value means default field. |
value | The value replacing the field of the action. |
example
[
{
"id": "0d2904db-b128-4862-ad46-708244b2c753",
"value": "https://www.coolparse.com"
},
{
"id": "87f25dc4-5d5f-47dc-a378-f5bc9f3a7fbc",
"value": "some text"
},
{
"id": "a7ce70e2-5932-4b07-a910-b0bba80ea2c6",
"value": 22
},
{
"id": "fb4ead12-99bc-438b-9c81-29bb547c9a33",
"value": [
"https://www.coolparse.com/loop1",
"https://www.coolparse.com/loop2",
"https://www.coolparse.com/loop3",
"https://www.coolparse.com/loop4"
]
}
]
Encoded to string,
[{\"id\": \"0d2904db-b128-4862-ad46-708244b2c753\",\"value\": \"https://www.coolparse.com\"},{\"id\": \"87f25dc4-5d5f-47dc-a378-f5bc9f3a7fbc\",\"value\": \"some text\"},{\"id\": \"a7ce70e2-5932-4b07-a910-b0bba80ea2c6\",\"value\": 22},{\"id\": \"fb4ead12-99bc-438b-9c81-29bb547c9a33\",\"value\": [\"https://www.coolparse.com/loop1\",\"https://www.coolparse.com/loop2\",\"https://www.coolparse.com/loop3\",\"https://www.coolparse.com/loop4\"]}]
Response Examples
{
"code": 0,
"data": {
"batchID": 56693373452336,
"taskID": "9b57f4b8-c5d5-44ac-b4be-52ea63ae4742",
"createTime": 1731751954753,
"updateTime": 1731751954753,
"startTime": 0,
"endTime": 0,
"useTime": 0,
"lineCount": 0,
"duplicatedCount": 0,
"captchaCount": 0,
"ipProxyUsage": 0,
"region": "us",
"nodeCount": 1,
"jobCount": 3,
"status": 1,
"fields": [
{
"id": "9a117f51-1451-4b4b-9d0f-1ec2e33c63ff",
"name": "Title"
},
{
"id": "1a1978dc-c9ce-4ec7-b64b-19b28a451e97",
"name": "Link"
},
{
"id": "a47a81b0-50df-4a33-ba78-efcbf34b7c0c",
"name": "Description"
}
]
}
}
Responses
HTTP Status Code | Meaning | Description | Data schema |
---|---|---|---|
200 | OK | none | Inline |
Responses Data Schema
HTTP Status Code 200
Name | Type | Required | Restrictions | Title | description |
---|---|---|---|---|---|
» code | integer | true | none | none | |
» data | object | true | none | Info of the running batch. | |
»» batchID | integer | true | none | none | |
»» taskID | string | true | none | none | |
»» createTime | integer | true | none | none | |
»» updateTime | integer | true | none | none | |
»» startTime | integer | true | none | none | |
»» endTime | integer | true | none | none | |
»» useTime | integer | true | none | none | |
»» lineCount | integer | true | none | Number of scraped data rows. | |
»» duplicatedCount | integer | true | none | Number of duplicated data rows. | |
»» captchaCount | integer | true | none | Usage of captcha credits. | |
»» ipProxyUsage | integer | true | none | Usage of IP proxy credits. | |
»» region | string | true | none | none | |
»» nodeCount | integer | true | none | none | |
»» jobCount | integer | true | none | none | |
»» status | integer | true | none | none | |
»» fields | [object] | true | none | Fields setting inExtract Data action. | |
»»» id | string | true | none | none | |
»»» name | string | true | none | none |
POST Stop Task
Stop the running task on cloud.
POST /cloud/stop
Body Parameters
{
"taskID": "9b57f4b8-c5d5-44ac-b4be-52ea63ae4742"
}
Params
Name | Location | Type | Required | Description |
---|---|---|---|---|
Authorization | header | string | yes | none |
x-cp-request-id | header | string | yes | none |
body | body | object | no | none |
» taskID | body | string | yes | Task ID copy from Task Overview. |
Response Examples
{
"code": 0,
"data": {
"batchID": 56629190950176,
"taskID": "9b57f4b8-c5d5-44ac-b4be-52ea63ae4742",
"createTime": 1731250529000,
"updateTime": 1731250529000,
"startTime": 0,
"endTime": 1731751949219,
"useTime": 9223372036854,
"lineCount": 0,
"duplicatedCount": 0,
"captchaCount": 0,
"ipProxyUsage": 0,
"region": "us",
"nodeCount": 1,
"jobCount": 3,
"status": 4,
"fields": [
{
"id": "9a117f51-1451-4b4b-9d0f-1ec2e33c63ff",
"name": "Title"
},
{
"id": "1a1978dc-c9ce-4ec7-b64b-19b28a451e97",
"name": "Link"
},
{
"id": "a47a81b0-50df-4a33-ba78-efcbf34b7c0c",
"name": "Description"
}
]
}
}
Responses
HTTP Status Code | Meaning | Description | Data schema |
---|---|---|---|
200 | OK | none | Inline |
Responses Data Schema
HTTP Status Code 200
Name | Type | Required | Restrictions | Title | description |
---|---|---|---|---|---|
» code | integer | true | none | none | |
» data | object | true | none | Info of the stopped batch. | |
»» batchID | integer | true | none | none | |
»» taskID | string | true | none | none | |
»» createTime | integer | true | none | none | |
»» updateTime | integer | true | none | none | |
»» startTime | integer | true | none | none | |
»» endTime | integer | true | none | none | |
»» useTime | integer | true | none | none | |
»» lineCount | integer | true | none | none | |
»» duplicatedCount | integer | true | none | none | |
»» captchaCount | integer | true | none | none | |
»» ipProxyUsage | integer | true | none | none | |
»» region | string | true | none | none | |
»» nodeCount | integer | true | none | none | |
»» jobCount | integer | true | none | none | |
»» status | integer | true | none | none | |
»» fields | [object] | true | none | none | |
»»» id | string | true | none | none | |
»»» name | string | true | none | none |
GET Get Batches
Get batches created by running cloud task.
GET /cloud/batches
Params
Name | Location | Type | Required | Description |
---|---|---|---|---|
page | query | string | no | Page index, start from 1. |
size | query | string | no | Size of page. |
taskID | query | string | no | Task ID copy from Task Overview |
orderBy | query | string | yes | 0-In reverse order of creation time, 1-In reverse order of running time, 2-In reverse order of scraped data count. |
Authorization | header | string | yes | none |
x-cp-request-id | header | string | yes | none |
Response Examples
200 Response
{
"code": 0,
"data": {
"list": [
{
"batchID": 0,
"taskID": "string",
"createTime": 0,
"updateTime": 0,
"startTime": 0,
"endTime": 0,
"useTime": 0,
"lineCount": 0,
"duplicatedCount": 0,
"captchaCount": 0,
"ipProxyUsage": 0,
"region": "string",
"nodeCount": 0,
"jobCount": 0,
"status": 0,
"fields": [
{
"id": null,
"name": null
}
]
}
],
"pagination": {
"page": 0,
"size": 0,
"total": 0
}
}
}
Responses
HTTP Status Code | Meaning | Description | Data schema |
---|---|---|---|
200 | OK | none | Inline |
Responses Data Schema
HTTP Status Code 200
Name | Type | Required | Restrictions | Title | description |
---|---|---|---|---|---|
» code | integer | true | none | none | |
» data | object | true | none | none | |
»» list | [object] | true | none | List of batch info. | |
»»» batchID | integer | false | none | none | |
»»» taskID | string | false | none | none | |
»»» createTime | integer | false | none | none | |
»»» updateTime | integer | false | none | none | |
»»» startTime | integer | false | none | none | |
»»» endTime | integer | false | none | none | |
»»» useTime | integer | false | none | none | |
»»» lineCount | integer | false | none | Number of scraped data rows. | |
»»» duplicatedCount | integer | false | none | Number of duplicated data rows. | |
»»» captchaCount | integer | false | none | Usage of captcha credits. | |
»»» ipProxyUsage | integer | false | none | Usage of IP proxy credits. | |
»»» region | string | false | none | none | |
»»» nodeCount | integer | false | none | none | |
»»» jobCount | integer | false | none | none | |
»»» status | integer | false | none | none | |
»»» fields | [object] | false | none | Fields setting inExtract Data action. | |
»»»» id | string | true | none | none | |
»»»» name | string | true | none | none | |
»» pagination | object | true | none | none | |
»»» page | integer | true | none | none | |
»»» size | integer | true | none | none | |
»»» total | integer | true | none | none |
GET Get Data
Get data scraped by cloud task. You can get batchID
from Get Batches
API.
GET /cloud/data
Params
Name | Location | Type | Required | Description |
---|---|---|---|---|
page | query | string | no | none |
size | query | string | no | none |
orderBy | query | string | yes | none |
batchID | query | string | no | CoolParse will create a batch for every Run Task action. |
preview | query | boolean | no | "true" means only return limited length of each field, otherwise return full data. |
deduplication | query | boolean | no | "true" means to ignore duplicate rows that was returned in the previous requests. Row with all the same field values are considered to be the duplicate row. |
deduplicationID | query | string | no | Key to identify the range for the duplicate row in sequent requests in the same task. Expires in 1 hour from the last request. |
Authorization | header | string | yes | none |
x-cp-request-id | header | string | yes | none |
Response Examples
{
"code": 0,
"data": {
"list": [
"{\"f19c720c-a430-49e0-b8ca-17f67da485bb\":\"data item 1\"}",
"{\"f19c720c-a430-49e0-b8ca-17f67da485bb\":\"data item 2\"}",
"{\"f19c720c-a430-49e0-b8ca-17f67da485bb\":\"data item 3\"}"
],
"pagination": {
"page": 1,
"size": 10,
"total": 3
}
}
}
Responses
HTTP Status Code | Meaning | Description | Data schema |
---|---|---|---|
200 | OK | none | Inline |
Responses Data Schema
HTTP Status Code 200
Name | Type | Required | Restrictions | Title | description |
---|---|---|---|---|---|
» code | integer | true | none | none | |
» data | object | true | none | none | |
»» list | [string] | true | none | none | |
»» pagination | object | true | none | none | |
»»» page | integer | true | none | none | |
»»» size | integer | true | none | none | |
»»» total | integer | true | none | none |