StreamSets Data Collector Installation Using tarball (SDC)

StreamSets Data Collector is a low-latency ingest infrastructure tool that lets you create continuous data ingest pipelines using a drag and drop UI within an integrated development environment (IDE). In this post i am going to show you Streamsets Data Collector Installation Using Tarball

Thousands of companies use the open source StreamSets Data Collector to efficiently build, test, run and maintain data-flow pipelines connecting a variety of batch and streaming data sources and compute platforms. Data Collector pipelines require minimal schema specification and uniquely detect and handle data drift.

Environment

    • OS: Ubuntu
    • Package: StreamSets Data Collector 3.3.0
  • Disk Space: Minimum 5GB
Component Minimum Requirement
Operating system Use one of the following operating systems and versions:

  • Mac OS X
  • CentOS 6 or 7
  • Red Hat Enterprise Linux 6 or 7
  • Ubuntu 14.04 LTS or 16.04 LTS
Cores 2
RAM 1 GB
Disk space 6 GB
File descriptors 32768
Java Oracle Java 8 or OpenJDK 8
Browser Use the latest version of one of the following browsers:

  • Chrome
  • Firefox
  • Safari

Prerequisites to install streamsets data collection application

  • Create sdc service account
  • Add sdc user account to sdc group
  • Install and configure java
  • Configure nofile limit to 32k files or more than that
# groupadd sdc
# useradd -r -g sdc -d /home/sdc -s /sbin/nologin sdc

# apt-get install java*  ##In Ubuntu 
# yum install java* ##In Red Hat/Centos / Fedora

Edit below file and add few entries to limit the files

# vi /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536

After setting up limits you need to reboot the Linux machine to take changes effect. After reboot verify using below command

# ulimit -n
65536

StreamSets Data Collector Installation Using tarball

The Very first step is to create directory, wherever space available for streamsets data collector download. In this example i have created a directory under “/” slash

# mkdir /sdc
# cd /sdc/
# wget https://archives.streamsets.com/datacollector/3.3.0/tarball/streamsets-datacollector-all-3.3.0.tgz

Extract downloaded streamsets tarball package

Using tar command you can extract compressed file, it it almost 3.7GB in size. Make sure that you have good internet bandwidth to download package from internet.

# tar -xvf streamsets-datacollector-all-3.3.0.tgz

After UN-compressing the package copy the required streamsets data collector supporting files to initd directory for starting and stopping SDC service

# cd streamsets-datacollector-3.3.0/
# cd initd/
# cp _sdcinitd_prototype /etc/init.d/sdc

Change extracted directory permissions to sdc user account to that application will get proper permissions to user its files. Add extracted directory location for source and destination path

# chown sdc:sdc /etc/init.d/sdc

# vi /etc/init.d/sdc

# installation directory of the data collector IT MUST BE SET
export SDC_DIST="/sdc/streamsets-datacollector-3.3.0"
export SDC_HOME="/sdc/streamsets-datacollector-3.3.0"

chmod 755 /etc/init.d/sdc  ##Provide Executable permissions

mkdir /etc/sdc
cd /sdc/streamsets-datacollector-3.3.0/
cp -R etc/* /etc/sdc/
chown -R sdc:sdc /etc/sdc
chmod go-rwx /etc/sdc/form-realm.properties

# Create Log Directory path to write application logs
mkdir /var/log/sdc
chown sdc:sdc /var/log/sdc

# Library Directory path to keep library files
mkdir /var/lib/sdc
chown sdc:sdc /var/lib/sdc

# Source paths
mkdir /var/lib/sdc-resources
chown sdc:sdc /var/lib/sdc-resources

Start and Stop Streamsets Data Collector Service

# /etc/init.d/sdc start
 update-rc.d sdc defaults 97 03
 history port.
streamsets data collector log file

streamsets log file

# /etc/init.d/sdc status
/sdc/streamsets-datacollector-3.3.0/libexec/sdcd-env.sh: line 83: export: `/sdc/streamsets-datacollector-3.3.0/streamsets-libs-extras/': not a valid identifier
running

Go to browser and type URL

http://IPADDRESS:18630

streamsets data collector default admin user password

streamsets data collector default admin user password

Sample Data collection output

Sample Data collection output

Successful of streamsets data collector installation.

Related Articles

Disk Space Monitoring shell Script

For more Details

Implement User based quota management

Thanks for your wonderful Support and Encouragement

blank

Ravi Kumar Ankam

My Name is ARK. Expert in grasping any new technology, Interested in Sharing the knowledge. Learn more & Earn More

1 Response

  1. blank Igor says:

    For some reason when I try to start init.d/sdc start It asks me for the password of a sdc user, which is not described in the guide. Could be solved by setting password to user, but what if I don’t want to do it?

Leave a Reply

Your email address will not be published. Required fields are marked *