[FIXED] How to handle  (object replacement character) in URL

Issue

Using Jsoup to scrape URLS and one of the URLS I keep getting has this  symbol in it. I have tried decoding the URL:

url = URLDecoder.decode(url, "UTF-8" );

but it still remains in the code looking like this:
enter image description here

I cant find much online about this other than it is "The object replacement character, sometimes used to represent an embedded object in a document when it is converted to plain text."

But if this is the case I should be able to print the symbol if it is plain text but when I run

System.out.println("");

I get the following complication error:
enter image description here

and it reverts back to the last save.

Sample URL: https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/

NOTE: If you decode the url then compare it to the decoded url it comes back as not the same e.g.:

        String url = URLDecoder.decode("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles%ef%bf%bc/", "UTF-8");
        if(url.contains("https://www.breightgroup.com/job/hse-advisor-embedded-contract-roles?/")){
            System.out.println("The same");
        }else {
            System.out.println("Not the same");
        }

Solution

I found the issue resolved by just replacing URLs with this symbol because there are other URLs with Unicode symbols that were invisible that couldnt be converted ect..

So I just compared the urls to the following regex if it returns false then I just bypass it. Hope this helps someone out:

boolean newURL = url.matches("^[a-zA-Z0-9_:;/.&|%!+=@?-]*$");

Answered By – HelloWorld

Answer Checked By – Mary Flores (Easybugfix Volunteer)

Leave a Reply

(*) Required, Your email will not be published