Updates - December 12, 2024 -------------- | Updates | Project management | Data gathering and prep | Unified data set is complete and will be part of deliverable | Will update new data in this format | CV updates | NLP updates | Deliverable updates YOLO Pre-Processing Updates ~~~~~~~~~~ | Last meeting we discussed a final model for vape detection that was trained on 3 separate datasets, 2 of which were web scraped and labeled by GTRI. | All previous models we trained were only trained for vape detection, we decided to add an additional class of vape juice as these contain a lot of information of the brands, flavors, and nicotine content. | We went back through previous data and re-labeled it to include bounding boxes for vape juices, a total of 380 vape juices were annotated and added to the training data. | A new model was then trained for these two classes and used to make predictions on all 9 image datasets that we scraped from the web. | This new class allows for extracting images with vape juices to be passed to a VLM for further processing. .. image:: images/1212_1.png :alt: vape liquids :width: 100% :align: left VLM Examples ~~~~~~~~~~ | I passed the images to the left to NVIDIAs new VLM asking the following question for each. | Please give me the following information about this vape juice in a comma separated list. Brand name of juice, juice flavor, size of the bottle, nicotine content, is it a cooling flavor. | These are the responses I received | Cloud Nurdz, Peach Melon, 100ml, 3mg, Yes | Coastal Clouds, Caramel Brule, 30ML (1.01FL OZ), 3.5%, 35 MG/ML, No | Naked 100, Crisp Menthol, 60ML, 3MG, Yes | VaporLax, cool mint, 30ml, 50mg, Yes VLM In-Context Learning ~~~~~~~~~~ | In context learning involves interacting with the VLM and providing it examples of images and corresponding responses expected. | No re-training or tuning is done in the sense that the weights are not changed. | These examples are provided to model and then unseen images are given to the model and it is asked to provide the same information that it was shown for the examples. | Recent papers have shown that results can be significantly increased by providing the model as few as 40 examples of desired output. | We are currently working on the example prompts and a test set we can use for testing this method. NLP ~~~~~~~~~~ | Last time, we discussed iterative improvement of PRODUCT TYPE classification | Since then, we researched different product types available for vapes | Documentation available for review | Notably, added an Open System class for refillable vape products (not disposable / closed system) | We manually labeled >800 products for reference | Preliminary testing using new LLM prompt shows greater accuracy with revised classes | Also, how should we handle CBD products? | Flavors | Continuing work on vapewh, vape.com | RegEx (pattern matching) can capture majority of flavors, but edge cases causing issues | Working to set up LLM for parsing Gaming Variable ~~~~~~~~~~ | Evaluated vapewh, vapesourcing, perfect_vape, csvape, vapingdotcom, vapedotcom, myvaporstore, getpop | Only found about ~67 items (sent earlier today) | Main themes | Gaming features or actual games | Built-in games, retro games, mini-games, and classic games. | Gaming-inspired animations and game-like elements. | Gaming-Inspired Design and Animations | Animated screens, fidget spinners, TRON design | Reward and Tracking Systems | Reward systems with medals and trophies, puff counters, etc. | We can build NLP to track these but still quite rare .. image:: images/1212_2.png :alt: gaming vape :width: 100% :align: left Consolidating Data ~~~~~~~~~~ | Took scraped data from vapewh, vapesourcing, perfect_vape, csvape, vapingdotcom, vapedotcom, myvaporstore, getpop, elementvape and merged into one dataset | Future datasets can be added with only tweaks to conversion file | Process: | Path through every csv file in data folder | Open every file as a pandas dataframe and concatenate dataframes | Using conversion file, smush matching columns together | Clean up nested lists, lists of nan values, formatting data to be prettier | Working on some final cleaning and code restructuring | Final Products: | Excel file with all products | Excel file of columns to be blended | Script to rerun data blending and cleanup | Documentation on how to use the scripts and how to format incoming data | Documentation on what different columns mean (95 total columns!) Deliverable Update ~~~~~~~~~~ | Sync to github from our private gitlab is set up and will set to run regular updates | Initial code is moved there | Read The Docs setup and configured | Requires additional permissions to run automatically, but can run manually | Team is starting to migrate documentation over to that format | https://cdcf-ecig-clean-and-analysis-project.readthedocs.io/en/latest/ | Wrap up of this part of the contract | Wrapping up work | Will have code, documentation, and any data delivered and sent over e-mail | Will be done by EOM