Web Scraping

Sites

See scraping directory. For now, these are mostly in Jupyter notebooks and may require some future cleanup, but for the time being the notebooks are available for the following sites:

  • CS Vape

  • Get Pop

  • My Vapor Store

  • Perfect Vape

  • Vape.com

  • Vape Sourcing

  • Vape WH

  • Vaping.com

Some of the original regular expression functions developed as a demo are available here, but we expect them to all eventually be replaced and/or migrated to the NLP code section.

Total Items Scraped

Text fields are available in scraped_data directory. Images are available in via Box if needed, upon request.

Gathered Data

Site

Items gathered

Images gathered

mipod

1,053

1,036

csvape

621

439

getoop

972

972

myvaporstore

2,056

578

vape.com

5,454

34,589

perfectvape

923

2,835

vapewh

362

12,957

vapesourcing

2,587

34,243

vaping.com

1,202

4,020

ElementVape

Mipod provided by CDCF.