Updates - November 14, 2024

Updates
Project management
Data gathering and prep
NLP updates
CV updates

Data gathering and preparation

Planning to re-run the scraping to get categories/parent links from the sites
Should be done before next time
Also working on unifying all the data into one model
Previously only done for mipod

NLP Updates

Iteratively improved PRODUCT TYPE classification (72% -> 78%) on sample test set
Main issues are disposable systems being misidentified–planning to emphasize “disposable” in future iterations
Flavors
Some sites provide structured flavors, and some do not
Currently working on assessing those and determining best path forward
vape.com: Set up LLM functionality for parsing due to format variability
vapewh: Set up regex to parse as format is more consistent

NLP Updates SCREEN variable

Initial scan with GPT to determine patterns around screen and displays
Vapewh, vapesourcing, perfect vape, cs vape, vaping.com, vape.com, my vapor store, getpop
Found several patterns
Display types (LED, LCD, OLED, TFT)
Color/colorful displays
Touch screens
Curved screens
Battery and E-liquid indicators
“Smart” displays
Digital and HD displays
Animated
Backlit and Glow-in-the-dark

NLP Updates SCREEN variable

Should be able to be identified with regular expressions now that we have the patterns.
Initial regular expressions are built, need to test on larger subset.
Need to isolate which features are most important and/or if any exist.
Examples:
This vape has a colorful touch screen and a battery level indicator.
color_display: [‘colorful’]
touch_screen: [‘touch screen’]
battery_indicator: [‘battery level’]
Features a 3D curved OLED display with e-juice indicators.
display_type: [‘OLED’]
curved_screen: [‘3D curved’]
eliquid_indicator: [‘e-juice’]
High-definition LED smart display screen with animation.
display_type: [‘LED’]
smart_display: [‘smart’]
hd_display: [‘High-definition’]
animated: [‘animation’]

Computer Vision Updates

YOLO Pre-Processing Updates

Last meeting we showed the results on the My Vape Store data with and without tuning on a part of the dataset.
This week we incorporated the entire My Vape Store dataset into the model to further improve its performance.
To validate its performance, we labeled theCSVape data and used it as an unseen test set.
This dataset contained 436 total images with 376 of these containing NO vapes while the remaining 60 contained at least a single vape.
data gathering stats

YOLO Pre-Processing Updates Contd

When used for testing the model correctly classified 427 of the 436 images with 8 false negatives (missed vape images) and 7 false positives (image misclassified as having a vape).
This translates to an image classification accuracy of 97.9%.
For the final model we have added all theCSVape images to the full dataset and retrained.
This model will be used to classify all images from 9 different websites that were scraped by the team to identify images with vapes for use by the VLM.

Deployment Options for CV Models

Model Deployment Background

After training a model, how do we make it useable?
Model deployment the process of exposing the trained model to end-users for inference
Considerations:
Privacy
Computational resources
Number of requests and latency

Option 1: Push/Pull to/from Huggingface Hub

Pros:
Free! All compute is done locally
Privacy can make model visibility private
Flexibility model can be tuned/changed if necessary
Complexity model requires no management
Cons:
Compute must have access to GPU for reasonable inference speeds
Maintenance must write your own scripts to prepare data and feed to model

Option 2: Cloud Services (e.g: AWS Sagemaker)

Pros:
Simplicity model can be queried with a simple script like any API
Compute can leverage cloud providers resources
Cons:
Flexibility model is fixed. If changes are needed, must make them (potentially re-training) and re-deploy
Cost must pay for compute/memory resources
Overkill unnecessary amount of infrastructure for only a small number of requests

Option 3: Huggingface Inference Endpoints

Pros:
Simplicity of HF + flexibility of cloud
HF manages cloud services as needed only pay for what you use
Built to interface with huggingface models (where our model is built)
Cons
Model still lives in cloud must re-deploy to make changes
Paid (relatively cheap, $0.5 per GPU hour)

What do we recommend:

Inference endpoints are a simpler and more manageable version of AWS/GCP/Azure
If you want flexibility and are comfortable writing code to run the model, the first option is likely best
If you prefer a simpler API, inference endpoints are a better option