Generative AI

Technologies for Embedding AI

Generative AI is one of the most prominent fields in artificial intelligence today. The rapid pace of evolution is steadily broadening the scope of applications, ranging from business operations to advanced research. At SANEI HYTECHS, our R&D does not view Generative AI as a standalone tool. Instead, we focus on its implementation and operation as a core component of larger systems. We investigate optimal implementation methods for various use cases, prioritizing the practicality and reliability of the overall system while ensuring seamless integration with existing business processes. Below, we introduce our initiatives as an engineering firm, using the system implementation of Large Language Models (LLMs) as an example.

INDEX

Leveraging Large Language Models (LLMs)
Harnessing Multimodal Models
Understanding and Exploring Internal Architectures

Leveraging Large Language Models (LLMs)

What is an LLM?

Large Language Models (LLMs) are the flagship technology of Generative AI. An LLM takes natural language text as input and performs tasks such as text generation, summarization, and classification.

In recent years, models with trillions of parameters have emerged, requiring significant computational resources for training and operation. Consequently, it has become common practice to utilize publicly available foundation models or APIs, adjusting and extending them for specific applications.

Domain-Specific Optimization via RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is a key technique for tuning LLMs. RAG improves output accuracy by retrieving relevant documents based on the input text and providing them to the LLM as additional context.

This is highly effective for business systems and specialized fields, as it allows for the supplementation of domain-specific knowledge as external data. Customizing designs for each specific use case is essential to meeting diverse business requirements.

System Implementation Strategies for LLMs

As an example of our RAG implementation, we developed a configuration using the open-source "Llama" model on a cloud-based execution environment.

By utilizing a VPN connection, we established a verification environment in a short period while meeting security requirements equivalent to on-premises systems. Our modular design allows the LLM execution component to be swapped easily, facilitating comparative studies with paid APIs.

Furthermore, integrating the LLM directly into the system enables sophisticated I/O control that extends beyond basic prompt engineering. For instance, by internally managing the history of inputs to the model, we can maintain consistent output formats even when using relatively small-scale models. We are accumulating these implementation insights to advance our research across various system applications.

Example of RAG Configuration and Data Flow

Verification and Improvement through Agile Development

We employ an agile development methodology for implementation. By building web applications using libraries like "gradio," we have iteratively improved functional requirements and usability based on continuous testing and feedback.

Chatbot web application UI (built with Gradio)

Harnessing Multimodal Models

What are Multimodal Models?

Recent language models can handle multiple types of data as input, such as images and audio, in addition to text. These are known as multimodal models. Among these, Vision Language Models (VLM)—which can process images—are a primary research focus for us due to their high compatibility with our existing image processing technologies.

Scene Classification Using Dashcam Footage

As an example of our research, we present scene classification using driving data from dashcams. We target distinctive scenes that require caution during driving, such as intersections, poor weather, or emergency vehicles passing. When classifying frames containing these scenes, traditional methods can struggle because pixel-based features like brightness or hue show little variation between consecutive frames. To address this, we employed a method where the VLM describes each frame in text, using that output as a feature set. We use "Asagi," a Japanese-compatible VLM, for feature extraction, followed by "Llama" to standardize the text format. By performing cohesive clustering on the resulting descriptive features, we successfully extracted clusters containing specific conditions, such as scenes with traffic lights or heavy vehicles. By treating language models as system components to be combined in stages, we can build sophisticated, multi-layered architectures.

Scene classification results via a standalone application

Understanding and Exploring Internal Architectures

Challenges in Language Model Implementation

In systems integrating language models, model size and data volume are critical design factors. Implementing large-scale models requires investigating parameter reduction techniques such as pruning, distillation, and quantization. Simultaneously, managing input/output data poses a challenge. Since LLMs require a loop structure where the output serves as the next input, design considerations must extend beyond the model itself.

Transformer Architecture and Embedded Applications

The core of an LLM is the Transformer layer. Since matrix multiplication accounts for most of the computation in this layer, the processing structure is relatively simple despite the high computational load. However, some acceleration techniques in the Attention mechanism, such as KV caching, increase memory usage in exchange for reduced computation. Applying these directly to edge devices can create bottlenecks in memory bandwidth and data transfer. Based on these constraints unique to embedded environments, SANEI HYTECHS continues to research the implementation of language models on various devices.

Computational graph of an attention module with KV caching (Visualized with Netron)

Research and Development