Loading…
Attending this event?
October 28-29, 2024 | Tokyo, Japan
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit + AI_dev Japan 2024 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Japan Standard Time (UTC +9). To see the schedule in your preferred timezone, please select from the drop-down located at the bottom of the menu to the right.
Monday October 28, 2024 12:05 - 12:45 JST
Every conversation on AI starts with models and ends with data. Data preparation is emerging as a very important phase of the GenAI journey, as high quantity and quality text and code corpora for GenAI model training have shown to play a crucial role in producing high performing Large Language Models (LLMs). The data preparation phase in the Generative AI lifecycle aims to clean, filter, and transform the datasets of text and code that are acquired from various sources into a tokenized form that is suitable for the training of LLMs, be it pre-training, or constructing LLM apps via fine-tuning or instruct tuning. The latter poses unique challenges, as each use case may necessitate tailored data preparation approaches. Given the enduring and evolving demand for data preparation techniques in LLM applications, we are introducing Data Prep Kit as an open-source software asset. This endeavour is geared towards fostering collaborative efforts within the community, enabling collective development and utilization, and ultimately reducing time to value. DPK has been instrumental in powering the IBM open-source Granite models.
Speakers
avatar for Takuya Goto

Takuya Goto

Software Engineer, IBM
Takuya is a software engineer at IBM where he works on software product development, and open-source development. Takuya specializes in NLP, ML, and text-based data processing. In his free time, Takuya likes running, and traveling with my wife and son.
avatar for Daiki Tsuzuku

Daiki Tsuzuku

Software Developer, IBM
I have been working in IBM as a software developer for about 7 years. I have been the backend developer, and sometimes frontend developer, of Watson Explorer, Watson Discovery, and watsonx Orchestrate. My field is to develop the application of processing a wide variety and large volume... Read More →
Monday October 28, 2024 12:05 - 12:45 JST
Hall B (4)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link