Updates - October 17, 2024 
--------------

| Updates
| Project management
| Data gathering
| ElementVape completed
| Data cleaning and prep
| Starting to look at some data cleaning/prep processes, otherwise no
  updates
| NLP
| Progress on TFN
| Image classification


NLP Updates
~~~~~~~~~~

Made progress on TFN


TFN/CBD Samples
~~~~~~~~~~

.. image:: images/1017_1.png
   :alt: vapes tfn samples
   :width: 100%
   :align: left

Computer Vision Model Updates
~~~~~~~~~~

| Working on implementing pre-labeled vape data set to clean our images,
  and be able to extract non-vapes.
| Labeled and cleaned a decently sized dataset for screens (~9000)
  examples from each of the websites. Working on finetuning a model on
  this data.
| We are making some changes to our processing based on feedback from
  CDCF (separating out ICED vs. NON-ICED varieties on same product)
| Overall, is a more straightforward classification than screens so
  shouldnt be too much of a change


Image Cleaning
~~~~~~~~~~

| Found a public dataset of vape images with bounding box ground truth,
  about 2100 images.
| Trained a YOLOv8 model using 80% of the data for training and 20% for
  testing.
| Preliminary performance without any parameter tuning is about 82%
  accurate in detecting vapes in images.
| Working on improving this accuracy by increasing augmentation and
  potentially adding more data points.
| Goal is to use this model to filter through all of the web scraped
  images to eliminate those that do not contain vapes as a
  pre-processing step for the VLM.


Prediction Examples
~~~~~~~~~~

.. image:: images/1017_2.png
   :alt: prediction examples
   :width: 100%
   :align: left

Background: Vision-Language Models
~~~~~~~~~~

| Some vape data has a text component, a vision component, or both
| E.g: iced flavors, presence of screens, etc.
| Recent models (e.g: LlaVA, Chameleon) can ingest interleaved text and
  images
| They consist of an LLM backbone and a vision encoder/tokenizer


VLMs are strong zero-shot learners
~~~~~~~~~~


| Recent VLM research has focused on zero and few shot performance on
  various tasks
| E.g: Some VLMs can answer questions about images despite never being
  trained to do so
| Pros: VLMs are very adaptive to novel tasks. We can take advantage of
  this to label data
| Cons: This can be inefficient, unreliable, and difficult to verify.
| Performance highly-dependent on choice of prompt

Our approach:
~~~~~~~~~~


| Design prompts for the variables we are interested in (for now screens
  and iced)
| Use LlaVA to label a portion of the data (~10,000 examples)
| Clean the data for inaccuracies much faster than manual labeling as
  LlaVA does a decent job and errors are predictable
| Fine-tune another VLM (for now, Flava) on this clean data to achieve
  more-reliable performance

Flava is a VLM that can perform both multimodal and unimodal
vision/language tasks

.. image:: images/1017_3.png
   :alt: vapes with screens
   :width: 100%
   :align: left


Potential options:
~~~~~~~~~~


| We can train the model and use it to label the vape data
| We can also deploy the model, allowing CDC groups to query in via an
  API without our involvement

Huggingface provides a free inference tool we can develop and share with
the CDC