LeetCode Scrapper

A python script to scrap question data from LeetCode

Hosted on GitHub : scrapeet

The Need For This Script

I had a habit of posting solutions to LeetCode problems that I solve on my GitHub repository. But I had to manually copy the question data from the LeetCode website & paste it into a file of certain format which I follow.

The data which I have to manually copy included:

Question ID
Question Title
Question Difficulty
Problem Statement
Example Test Cases
Constraints

So, I thought of automating this process by writing a script which will scrap the question data from the LeetCode website.

Note : This solves only 1st part of the automation problem that I have. The 2nd part is to automate the process of creating the solution file & copying the question data into it.

The Approach

My initial approach was to use selenium to scrap the data from the link/url of the LeetCode problem. But I hit a roadblock when I couldn't find a way to extract the data from the page source. The only way I could think of was to extract HTML elements but the page source was dynamic & it was a burden to target the divs & classes.

Then I came across a StackOverflow answer which suggested to use simple POST requests to get the dynamic content of the page using the URL Slug. I'd like to thank that stranger.

So, the approach is to send a POST request, for which we can use the requests library in Python. This request will return a JSON response containing the requested data.

Response json of (https://leetcode.com/problems/two-sum/description/):

response.json

{
  "data": {
    "question": {
      "questionId": "1",
      ...
      "title": "Two Sum",
      "titleSlug": "two-sum",
      "content": "<p>Given an array of integers <code>nums</code>&nbsp;and an integer <code>target</code>, return <em>indices of the two numbers such that they add up to <code>target</code></em>.</p>\n\n<p>You may assume that each input would have <strong><em>exactly</em> one solution</strong>, and you may not use the <em>same</em> element twice.</p>\n\n<p>You can return the answer in any order.</p>\n\n<p>&nbsp;</p>\n<p><strong class=\"example\">Example 1:</strong></p>\n\n<pre>\n<strong>Input:</strong> nums = [2,7,11,15], target = 9\n<strong>Output:</strong> [0,1]\n<strong>Explanation:</strong> Because nums[0] + nums[1] == 9, we return [0, 1].\n</pre>\n\n<p><strong class=\"example\">Example 2:</strong></p>\n\n<pre>\n<strong>Input:</strong> nums = [3,2,4], target = 6\n<strong>Output:</strong> [1,2]\n</pre>\n\n<p><strong class=\"example\">Example 3:</strong></p>\n\n<pre>\n<strong>Input:</strong> nums = [3,3], target = 6\n<strong>Output:</strong> [0,1]\n</pre>\n\n<p>&nbsp;</p>\n<p><strong>Constraints:</strong></p>\n\n<ul>\n\t<li><code>2 &lt;= nums.length &lt;= 10<sup>4</sup></code></li>\n\t<li><code>-10<sup>9</sup> &lt;= nums[i] &lt;= 10<sup>9</sup></code></li>\n\t<li><code>-10<sup>9</sup> &lt;= target &lt;= 10<sup>9</sup></code></li>\n\t<li><strong>Only one valid answer exists.</strong></li>\n</ul>\n\n<p>&nbsp;</p>\n<strong>Follow-up:&nbsp;</strong>Can you come up with an algorithm that is less than <code>O(n<sup>2</sup>)</code><font face=\"monospace\">&nbsp;</font>time complexity?",
      ...
      "topicTags": [
        {
          "name": "Array",
          "slug": "array",
          "translatedName": null,
          "__typename": "TopicTagNode"
        },
        {
          "name": "Hash Table",
          "slug": "hash-table",
          "translatedName": null,
          "__typename": "TopicTagNode"
        }
      ],
      ...
      "stats": "{\"totalAccepted\": \"13.6M\", \"totalSubmission\": \"25.7M\", \"totalAcceptedRaw\": 13641591, \"totalSubmissionRaw\": 25743266, \"acRate\": \"53.0%\"}",
      ...
      "__typename": "QuestionNode"
    }
  }
}

The Script

Overview

This is the script which I wrote to scrap the question data from the LeetCode website:

scrapper.py

import requests
from bs4 import BeautifulSoup as bs
 
def scrapper(titleSlug : str):
    data = {
    "operationName":"questionData",
    "variables":{"titleSlug":titleSlug},
    "query":"query questionData($titleSlug: String!) \
            {\n  question(titleSlug: $titleSlug)     \
                {\n questionId\n questionFrontendId\n boundTopicId\n title\n titleSlug\n content\n translatedTitle\n translatedContent\n isPaidOnly\n difficulty\n likes\n dislikes\n isLiked\n similarQuestions\n contributors {\n   username\n   profileUrl\n   avatarUrl\n   __typename\n }  \
                 \n langToValidPlayground\n topicTags {\n   name\n   slug\n   translatedName\n   __typename\n }\n companyTagStats\n codeSnippets {\n   lang\n   langSlug\n   code\n   __typename\n }\n stats\n hints\n solution {\n   id\n   canSeeDetail\n   __typename\n }  \
                 \n status\n sampleTestCase\n metaData\n judgerAvailable\n judgeType\n mysqlSchemas\n enableRunCode\n enableTestMode\n envInfo\n libraryUrl\n __typename\n  }  \
            \n}\n"
    }
 
    r = requests.post("https://leetcode.com/graphql", json=data).json()
    soup = bs(r["data"]["question"]["content"], "html.parser")
 
    return r, soup

Deep Dive Into The Script

Import the necessary libraries:
- requests: To send the POST request.
- BeautifulSoup: To parse the HTML content.
The dictionary containing the GraphQL query to be sent in the POST request:
```
data = {
    "operationName": "questionData",
    "variables": {"titleSlug": titleSlug},
    "query": """query questionData($titleSlug: String!) 
    { ... }"""
}
```
The query requests various details about the problem, including its ID, title, content, difficulty, tags, code snippets, and more.

Send the POST request:

r = requests.post("https://leetcode.com/graphql", json=data).json()

Parse the HTML content of the problem statement:

<p>Given an array of integers <code>nums</code>&nbsp;
....
....
<font face="monospace">&nbsp;</font>time complexity?'

We care only about the plain text of the content. So, to remove the HTML tags and get the plain text, we use the BeautifulSoup library:

soup = bs(r["data"]["question"]["content"], "html.parser")

Formatting This To Fit My Use Case

I don't freaking want the entirety of the json response. What should I do with companyTagStats, judgeType, mysqlSchemas, etc.? I had a simple use-case, which was to get the questionId, title, difficulty & content which I would pass onto some other file.

So, I simply extracted the required data from the json response one by one:

    ...
    question_id = r["data"]["question"]["questionId"]
    title = r["data"]["question"]["title"]
    content = soup.get_text().replace(u'\xa0', u' ')
    difficulty = r["data"]["question"]["difficulty"]
    ...
    # Some more code to split content into problem statement, example test cases & constraints
    return ...

What the heck is u'\xa0'? It is a non-breaking space character in Unicode, which was being used in the HTML content. We, don't want any unnecessary characters in our content (as it may break the output anytime), so I replaced it with a normal space.

For Some Other Night

Stick around for the next part where I'll automate the process of creating the solution file & copying the question data into it. Also, I maintain a README file which acts as a log of all the solutions. I'll automate the process of updating the README file as well.