Join me for an incredible workshop to unlock the full potential of Anti-Ban & Web Scraping! From novice to virtuoso, youโ€™ll learn the latest legal techniques for collecting crucial datasets to train AI models. ๐Ÿ” Highlights ๐Ÿ” Protection Disclosed ๐Ÿš€ Overcome fingerprint challenges and anti-bot measures. ๐Ÿ” Reverse engineering protection to understand signals tracking Proxy and Browser Farms Adventure ๐ŸŒŠ Discover Scrapoxy, the free and open-source proxies waterfall tailored for Web Scraping ๐ŸŽฏ Become an expert in browser farms with Playwright This workshop tailored for intermediate developers will immerse you in the secret world of anti-bot protections. Basic knowledge of Python and JavaScript is recommended - but don't worry if you're new to it, I'll be here to help you every step of the way. ๐Ÿ˜‰ ๐Ÿ› ๏ธ Preflight Checklist ๐Ÿ› ๏ธ To simplify the installation process, I prepared an Ubuntu Virtual Machine for you with Chrome, VSCode, Python, Node.js, Playwright, and all the necessary dependencies. You can download it here: https://bit.ly/scwsfiles Don't miss the unique opportunity to master these essential skills! #data #webscraping #workshop #antibot --- This workshop tailored for intermediate developers will immerse you in the secret world of anti-bot protections. Basic knowledge of Python and JavaScript is recommended - but don't worry if you're new to it, I'll be here to help you every step of the way. Compiled text This session is a workshop with progressively challenging exercises, lasting 90 to 180 minutes to fit your schedule. You can preview the workshop here: https://github.com/fabienvauchelles/scraping-workshop Weโ€™ll tackle protection measures step by step with proxies, headless browsers and deobfuscation. I developed the website https://trekky-reviews.com specifically for this workshop, featuring the latest techniques used by anti-bot systems. The ideal attendance size is 30, but I can easily accommodate between 15 and 60 participants. The best part? Everyone will walk away with actionable skills to legally gather data using these cutting-edge methods. Alternatively, I offer a 45-minute live-coding session if that's preferred. Hereโ€™s a sneak peek of the 2-hour workshop: 1. Introduction (4 mins) To kick off the workshop, I engage the participants by asking about their experiences with bypassing website protection. This sets the stage for introducing myself and expressing my passion for web scraping and reverse-engineering anti-bot measures. 2. Legal (4 mins) Let's take a proactive approach. Here's a straightforward decision pathway: If the data is public, non-personal, you don't need to agree to any terms (T&C), and you're not causing harm (DDoS), then you're good to go! 3. Website Target Structure (4 mins) I created a dedicated website for this workshop: https://trekky-reviews.com/. This site features various iterations. Each fortified with progressively challenging protections. Throughout the workshop, we'll manoeuvre through these defences. 4. Framework Installation and 1st challenge (15 mins) I will guide participants through the installation of the Scrapy framework and kickstart the first project. 5. Basic Challenge-Solving (15 mins) Participants will engage in solving 2 challenges: - Bypass Useragent filtering - Add consistent HTTP headers 6. Proxies Overview (5 mins) I explain the different types of proxy: Datacenter, ISP, Residential, and Mobile, outlining their respective advantages and drawbacks. 7. Proxies Challenges (20 mins) We'll set up Scrapoxy and configure the first connector. Participants will tackle 2 challenges: - Bypass Rate Limit with Datacenter proxies - Avoid detection with ISP proxies 8. Headless Browser Challenge (20 mins) Participants will install Playwright and tackle a series of challenges, including: - Executing Javascript with a headless browser - Tuning headless browser parameters (like timezone) 9. Code Deobfuscation (10 mins) I'll introduce techniques for deobfuscating both strings and code-flow. 10. Deobfuscation Challenge (20 mins) With the installation of Babel.js, participants will start reverse engineering a protection through deobfuscation. They will replicate the anti-bot behaviour, including payload encryption. 11. Conclusion (3 min) As a wrap-up, I will present upcoming challenges and potential solutions, leaving us with food for thought into the future of protections. Iโ€™ve previously spoken at Devoxx, PyCon, and other conferences. You can watch my latest recorded talk here: https://bit.ly/webscrapingvideopyconlt2024 I hope this submission meets your expectations for the conference!

Talk Level:
BEGINNER

Bio:
Fabien Vauchelles is the Anti-Ban Expert at Wiremind. With over a decade of experience in web scraping, Fabien's passion for code and technology helps him to bypass bans. He is the creator of Scrapoxy, an opensource proxy aggregator for webscraping. He had the opportunity to speak at JPrime, Devoxx FR, Zyteโ€™s Extract Summit and Voxxed Days. You can access the presentation slides from one of his talks attached here, along with a recording of his presentation at the 2023 Extract Summit (https://www.youtube.com/watch?v=nU-9P7rKdPo&t=12950s).