Kindly note that Jobline will be offline for maintenance on this Friday (May 14, 2021) from 5:00 PM to 9:00 PM.

Responsibilities

Administration and Operation

  • Administration and maintenance Linux servers (CentOS / RHEL based)
  • Managing the backups and archival
  • System planning and extension and involvement in the acquisition of new components
  • Automating systems administration tasks utilizing open-source configuration management tools
  • Providing technical support and user support as needed
  • Corrective, progressive, preventive and maintenance of the ODySSEy Data Platform; including hardware, software
  • Processing and analysis of anomalies
  • Understanding and analysis of users’ needs to tune configurations
  • System Monitoring, troubleshooting and resolution procedures, RACI, SLAs
  • Building consensus and soliciting input when making significant changes, and maintain good channels of communication in terms of decisions and policies associated with the delivery of ODySSEy services
  • Implementing time-sensitive tasks under minimal supervision
  • Planning and coordinating patching schedule for all components of the ODySSEy including software and firmware updates, through effective vendor management
  • Assisting users with experiment and application setup using a variety of development, performance analysis, and hardware configuration tools
  • User account management

Troubleshooting

  • Performing problem determination and proposing resolution to maintain and administer the ODySSEy resources used to support the research mission

Others

  • Working with other groups, such as, Vendor(s), Infrastructure, Supercomputing, and Research Team to support the continued and efficient operation of the ODySSEy resources
  • Maintaining relationships with multiple hardware and software vendors
  • Maintaining operational policies, procedures and practices necessary for reliable delivery of ODySSEy services
  • Researching and staying abreast of recent trends in software and hardware for the data science use
  • Participation in meetings
  • Performing other duties as assigned

Requirements

  • Bachelor/Master Degree in Computer Science or other scientific domain with a professional experience (minimum 4 years) in HPC or Data Science Platform operations
  • Advanced knowledge and proven ability to design, install, and maintain large scale Linux-based Data Science systems.
  • Good understanding of HPC programming programming and scripting in R, Python etc.
  • Very good knowledge of HPC related hardware technologies including containerization etc.
  • Advanced Linux knowledge
  • Thorough knowledge of security access and network protocols such as TCP/IP, DHCP, DNS, NFS and VPN
  • Thorough knowledge of using Microsoft Active Directory and binding it to Linux Operating Systems
  • Ability to recommend processes e.g. automation that enhance utility and operation of computer systems
  • Ability to effectively troubleshoot minor to complex operating system, software and hardware failures and take appropriate corrective actions and/or develop sound workarounds
  • Ability to research problem solutions and maintain knowledge of current technologies
  • Open-minded, flexible and think interdisciplinary
  • Strong end user service skills, team skills, and the ability to collaborate within a cross-functional teams
  • Creative, autonomous, proven organizational and communicational skills
  • Organized and rigorous, sense of responsibility
  • Result and service-oriented
  • Ability to analyse, organize and multitask on support & operations, and prioritization of tasks