BeautifulSoup - I can't translate html tags that contain <a href=..</a> OR

Melcu54 · (This post was last modified: Oct-25-2022, 05:26 PM by Melcu54.)

I can't translate html tags that contain other tags (such as <a href=..</a> OR )

In example below, the paragraph .. is the problem, I cannot translate. All other p classes are translated very well. Except this class, because it has in it those <a href=..</a> OR 

I try so many things. I don't know why is not working my code. I don;t get any error. Just, this class is not translated.

    <p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau</p>.

**THIS IS THE PART OF THE CODE**

    import os
    from bs4 import BeautifulSoup, NavigableString
    import re
    import textwrap
    from googletrans import Translator
    import pprint
    
    ...

    with open(f"{base_path}/{file}" , "r" , encoding='utf8', errors='ignore') as open_file:
      data = open_file.read()
    if data == "":
      print("{} este gol".format(file))
      continue
    lxml1 = str(BeautifulSoup(data, 'lxml'))
    #lxml1 = data
    lxml1 = lxml1.replace("\ufeff" , " ")
    #lxml1 = lxml1.replace("\n" , " ")
    #lxml1 = re.sub(' +', ' ', lxml1)
    if(read_tags == True):
      soup = BeautifulSoup(data, 'lxml')
      title_tag = soup.find("title")
      ist_p_tag = soup.find("p" , class_="text_obisnuit2")
      ist3_p_tag = soup.find("p" , class_="JAGAAA")
      second_p_tag = soup.find("p" , class_="donoo")
      meta_tag = soup.find("meta")
      if(title_tag ==  None):
        print("Title tag does not found")
      else:
        translated_title = translator.translate(title_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(title_tag.text,translated_title.text)
      if(meta_tag ==  None):
        print("meta tag does not found")
      else:
        translated_meta = translator.translate(meta_tag["content"], dest=input_lang)
        lxml1 = lxml1.replace(meta_tag["content"],translated_meta.text)
        
      if(ist_p_tag == None):
        print("<p class='text_obisnuit2' /> not found")
      else:
        translated_p = translator.translate(ist_p_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(ist_p_tag.text,translated_p.text)

      if(ist3_p_tag == None):
        print("<p class='JAGAAA' /> not found")
      else:
        translated_p = translator.translate(ist3_p_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(ist3_p_tag.text,translated_p.text)

**Larz60+** · Oct-25-2022, 08:21 PM

instead of:
ist3_p_tag = soup.find("p" , class_="JAGAAA")
try:
ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})

Melcu54 · (This post was last modified: Oct-26-2022, 03:31 PM by Melcu54.)

hello, thanks, but is not working. So, I have as you say:

      ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})

      if(ist3_p_tag == None):
        print("<p class='JAGAAA' /> not found")
      else:
        translated_p = translator.translate(ist3_p_tag .text, dest=input_lang)
        lxml1 = lxml1.replace(ist3_p_tag .text,translated_p.text)

It doesn't translate. The code make a skip on this tag.

wavic · Oct-26-2022, 04:01 PM

What does ist3_p_tag contain?

Melcu54 · (This post was last modified: Oct-26-2022, 04:06 PM by Melcu54.)

Contain text with tags and <a href=..</a> tag. This is why doesn't work, because of this 2 inside tags

<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau.</p>

wavic · Oct-26-2022, 06:42 PM

What about ist3_p_tag.text ?

I see that you try to get the text but have you checked it out?

Melcu54 · (This post was last modified: Oct-26-2022, 07:12 PM by Melcu54.)

I tried all these options below, and it still doesn't work

ist3_p_tag = soup.find("p" , {'class': "JAGAAA"})
ist3_p_tag = soup.find('p', attr={'class_': 'JAGAAA'})
ist3_p_tag = soup.find("p" , attr={'class_': "JAGAAA"})
ist3_p_tag = soup.find_all("p", class_="JAGAAA")
ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})
ist3_p_tag.text = soup.find("p" , {'class_': "JAGAAA"})

wavic · (This post was last modified: Oct-27-2022, 05:01 AM by wavic.)

I am unable to reproduce what you are talking about. Still not clear what ist3_p_tag.text returns/contain.

Here is mine:

>>> from bs4 import BeautifulSoup

>>> html = """<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre
 tanarul Hamlet, care voia sa razbune moartea tatalui sau</p>""" 

>>> soup = BeautifulSoup(html, 'lxml')

>>> soup

<html><body><p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre t
anarul Hamlet, care voia sa razbune moartea tatalui sau</p></body></html>

>>> p = soup.find('p', class_='JAGAAA')

>>> p
<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamle
t, care voia sa razbune moartea tatalui sau</p>

>>> p.text
'Intr-un articol precedent,  Dupa toate regulile artei , v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau'

As you can see the text is in p.text regardless of inline tags.

Melcu54 · (This post was last modified: Oct-27-2022, 08:08 AM by Melcu54.)

as you can see, for each particular html or </a> in it

to_p_tag = soup.find_all('p', class_='text_obisnuit')
ist_p_tag = soup.find("p" , class_="text_obisnuit2")
second_p_tag = soup.find("p" , class_="donoo")
ist3_p_tag = soup.find("p" , class_="JAGAAA")

wavic · (This post was last modified: Oct-27-2022, 08:49 AM by wavic.)

Doesn't matter if you call it ist3_p_tag or p as I did. How exactly doesn't work?

If soup.find can't find "p" , class_="JAGAAA" it will return None and ist3_p_tag will be None.

In your code you are checking if ist3_p_tag is None. Does it print " not found" as it should?

If not, then ist3_p_tag is not None and ist3_p_tag = soup.find("p" , class_="JAGAAA") should be working.

Put

print(ist3_p_tag.text)

at the end of your code to see what it contains.

If it contains just all the text in the p tag then it works fine and you have to see why the translation isn't working.

Look at my code. It is the same p tag and the CSS selector is used the same way as you do and soup.find is doing well. The inline tags are not the problem

Put that print above as I suggested and see if you are getting the text. If you do the translation module is causing this "not working"

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	why doesn't it replace all html tags?	Melcu54	3	2,696	Jul-05-2023, 04:47 AM Last Post: Melcu54
	googletrans library to translate text language for using data frame is not running	gcozba2023	0	4,188	Mar-06-2023, 09:50 AM Last Post: gcozba2023
	Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row	AaronCatolico1	0	2,997	Dec-25-2022, 06:28 PM Last Post: AaronCatolico1
	Get text from within h3 html tags	Pedroski55	8	8,314	Jan-05-2022, 06:50 AM Last Post: Larz60+
	How to Translate a python code written in Mac-OS to Windows?	alexanderDennisEnviro500	2	5,172	Jul-31-2021, 08:36 AM Last Post: Gribouillis
	reading html and edit chekcbox to html	jacklee26	5	5,673	Jul-01-2021, 10:31 AM Last Post: snippsat
	Parsing link from html tags with Python	Melcu54	0	2,713	Jun-14-2021, 09:25 AM Last Post: Melcu54
	Delimiters - How to skip some html tags from being translate	Melcu54	0	2,600	May-26-2021, 06:21 AM Last Post: Melcu54
	Including a Variable In the HTML Tags When Sending An Email	JoeDainton123	0	3,056	Aug-08-2020, 03:11 AM Last Post: JoeDainton123
	Translate to noob a Name Eroor message	bako	2	3,495	Mar-30-2020, 05:58 PM Last Post: bako

BeautifulSoup - I can't translate html tags that contain <a href=..</a> OR <em></em>

User Panel Messages

Announcements