You are currently viewing Web Scraping with Jsoup in Java

Web Scraping with Jsoup in Java

  • Post author:
  • Post category:Java
  • Post comments:0 Comments
  • Post last modified:February 23, 2024

Jsoup is a Java library used for extracting and manipulating data from HTML and XML documents. In this tutorial, you’ll learn how to perform web scraping using Jsoup step by step.

Prerequisites

Before starting, ensure you have the following:

  • Java Development Kit (JDK) installed on your machine.
  • A Java IDE like Eclipse, IntelliJ IDEA, or NetBeans.
  • Basic understanding of Java programming language.

Step 1: Adding Jsoup to Your Project

  1. If you’re using Maven, add the Jsoup dependency to your pom.xml:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>
  1. If you’re not using Maven, download the Jsoup JAR file from the official website and add it to your project’s classpath.

Step 2: Creating a Java Application

  1. Open your Java IDE.
  2. Create a new Java project.

Step 3: Writing Web Scraping Code

Now, let’s write Java code to scrape data from a website.

  1. Import the Jsoup library in your Java class.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
  1. Write code to connect to a website and parse its HTML content.
public class WebScraper {

    public static void main(String[] args) {
        try {
            // Connect to the website
            String url = "https://example.com";
            Document doc = Jsoup.connect(url).get();

            // Extract data
            Elements elements = doc.select("CSS_SELECTOR");

            // Process extracted data
            for (Element element : elements) {
                // Perform operations on each element
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Replace "CSS_SELECTOR" with the appropriate CSS selector to target the elements you want to scrape. You can find CSS selectors using browser developer tools.

Step 4: Extracting Data

In this step, you’ll extract data from the HTML document using Jsoup.

  1. Identify the elements you want to scrape using CSS selectors.
  2. Use Jsoup’s select() method to select those elements.
  3. Iterate through the selected elements and extract the desired data.
// Extracting data
Elements titles = doc.select("h2.title");
for (Element title : titles) {
    System.out.println("Title: " + title.text());
}

Elements links = doc.select("a[href]");
for (Element link : links) {
    System.out.println("Link: " + link.attr("href"));
}

Step 5: Running the Application

  1. Build your project.
  2. Run the Java application.

You should see the scraped data printed on the console.

Conclusion

Remember to use web scraping responsibly and follow the terms of service of the websites you scrape. Jsoup provides powerful features for web scraping, allowing you to extract data from HTML and XML documents with ease. Explore further to discover more advanced techniques and functionalities offered by Jsoup.

Leave a Reply