Python Job Crawler Architecture

Python scripting
Python scripting

While active in the “job search” role, it was difficult going online and searching for jobs manually as there were many thousands of options. It occurred to me that I could automate, yes automate the process. The preferred language for this task was Python. Python had various in-built modules that could be composed together to accomplish the task, much more easily than other languages.

Thus, I started off building the components as follows:

Jobs Class

A Job class to hold the details of jobs, such as title, description, company, contract type, salary, etc. this enabled an encapsulation of the essentials of a Job instance.

Target Portals

Then I had to think about the target jobs portals. I did pick a list of about 5 but ended up coding 4, since the coding task tended to be more like a “surgery” than just coding. A little mistake and the whole application is broken. These I added into a list (Python type array).

URL Composition

Then, once the target portals were selected, I had to compose the URL. I discovered that a few sites used a similar URL pattern, and this made the job easy. Nonetheless, it was still a challenging task to get the right combination of values. From a separate collection of list of job types, contract roles, salary, location, etc, I had to compose the exact URL as would be entered by a user search for similar jobs. When the URL failed, the program just trapped the error and moved on to the next URL. Sorted!

Data Storage

The next challenge was data storage. This was easy as SQLite was equal to the task. Likewise, Python had a module to plug into SQLite. So I created a job_list.db database and connected to it from code. The Jobs class enabled creating and instance and asking the instance to create itself in the database.

Data Duplication

I did try to identify the id used for the jobs by each job portal. This was possible for about two sites, which used the same id for their jobs. However, for others it seemed like the id changed on and off. Nonetheless, I was able to retrieve non-duplicate job titles and I think the result would only have meant that some jobs were dropped when I have used values that were expected to be duplicated among jobs. The number of jobs received did not allow the loss to have such an impact.

GUI Development

I then loaded the data into a GUI and displayed the jobs in a list control. It also enabled me to click a link on a job I like, and it takes me to the page that gives the job description.

Architecture

I have drawn out the architecture below, illustrating the interconnectivity between the components.

Outcome

The data recovered was quite substantial and enabled some data analytics, which I never carried out anyway. The job types sought included mainly development jobs such as Java Developer, Python Developer, Front-end Developer, C++ Developer, C#.NET Developer, JavaScript Developer, AWS, Azure, etc. At an epoch, the program could run for 20 minutes and retrieve about 5,000 job titles. On increasing this list to include temporary, part-time jobs, the number rose to 7,000. At this time I was not keeping records. But I achieved my aim. That aim was that, at the tap of a button, my program could go online and query the top job portals, retrieve all jobs that satisfied certain constraints including title, location, contract, salary range, etc., write this to a database, and them make them available in a GUI with a list control, enabling me to scroll though and view them without having to keep browsing for each one.

For the Future

The next plan, which I did not try, was to automate applications. What, with a program, I could apply to a hundred jobs in a second. Wouldn’t that be nice? But I had some challenges to face. I had to build custom cover-letters based on the job description. Perhaps this would be easy if I built ready-made letters for the job role, and only select special letters based on the identification of various values in the description text.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: