Question

I'm looking to find a way in python spark to search a string with separate two words. for example: IPhone x or Samsun s10 ...

I want to give a text file and (Iphone x) as a composite string for example, and get result then.

All what i find in the internet is just one word count

Answer 1

IUUC:

In spark 2.0 and if you were gunna read it from a file, for exemple a .csv file:

df = spark.read.format("csv").option("header", "true").load("pathtoyourcsvfile.csv")

then you can filter it using regex like this:

pattern = "\s+(word1|word2)\s+"
filtered = df.filter(df['<thedesiredcolumnhere>'].rlike(pattern))

Answer 2

You can try to write your own UDF combine with wordsegmente to segment your words, and you can add new word to the dictionary to help library to segment new words, such as "Iphone x"

For example:

>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']

If you don't want to use library, you can also see Word segmentation using dynamic programming

Answer 3

This is the answer:

# give a file
rdd = sc.textFile("/root/PycharmProjects/Spark/file") 

# give a composite string
string_ = "Iphone x" 

# filer by line containing the string
new_rdd = rdd.filter(lambda line: string_ in line) 

# collect these lines
rt = str(new_rdd.collect()) 

# apply regex to find all words and count 
count = re.findall(string_, rt) them

Is there a possibility in pySpark to search a string within two separate words?

3 个答案: